Migrating from AWS to AWS

This is a story about how we migrated from AWS Elastic Beanstalk to a Terraform + Consul + Nomad cluster in AWS EC2.

Some background

PDVend started using Heroku. It’s a very nice option to startups, since you just deploy your application, not having to deal with complex infrastructure concerns and huge CI/CD flows.

When you need to get bigger, there will come a time that you’ll realize that Heroku may not be cost efficient, depending on your needs; when that moment came to us, we started to migrate to AWS.

Together with AWS, we started a new project that brought new technology needs to our stack — as we were in the AWS hype, we choose 90% of Amazon’s services to compose it.

Things just worked. Like magic.

Our simplified AWS Stack

Our motivation to change

Why we’d change when there’s a big [incomplete] project in our hands? At some point in it, we got our first “future client”. They were a big case to our new product.

Between the business and technical meetings, we agreed that the application should run on their datacenters, on premises. Of course we tried to keep our SaaS model, but without success.

The first thought was: “Oh, okay. The application is dockerized and it will not be a big issue to run it in another datacenter.” Then we remembered about AWS dependencies. We can’t run AWS dependencies on premises.

We had to replace it with open source softwares, like Apache Kafka to handle background jobs queue and telemetry events delivery; eMQTTd instead of AWS IoT; among other choices.

It was not a big issue since our application was built based in the Onion Ring Architecture. Just writing a new adapter was easy peasy, we got absolutely zero issues arising therefrom.

The most worrying of this transition was having to deal with a bare metal datacenters. Our product born cloud-native, our tech team mindset is cloud-oriented, one of our differentials is bringing our customers to cloud, to ubiquity. We saw it as a shame!

The big question was: how can we manage auto-scaling, resiliency, orchestration, etc. with the same ease on bare metal as cloud?

We looked a few tools that maybe could do this for us: Kubernetes, Apache Mesos and Docker Swarm. But our girl with shining eyes was Nomad, it was love at first sight.

The point of all these options is to abstract infrastructure to the application. We can imagine that two machines with 2GB RAM each are a big machine of 4GB RAM running our application (of course it’s a simplistic point of view).

Nomad worked well to us because of the simple installation and the big range of possibilities: you just drop a binary at /usr/local/bin and start using.

Even if Nomad is the simplest tool that we’ve found, it does not mean that it is that simple. We think that all of the tools would have similar difficulties, probably because of the learning curve. Anyway, in one work day we made a PoC that showed us that it was viable to abstract infrastructure in our case.

Nomad with Steroids

We also used Consul, even knowing that it was not a requirement. Consul abstract from Nomad (and pretty much from us) the responsibility of discovering and registering new nodes.

Consul keeps the Nomad simplicity: drop a binary and that’s all. You don’t even need to tell Nomad to use Consul service discovery.

After our PoC, we saw that it would be possible to manage our application independent of the infrastructure below. Then it was the time to replace our Elastic Beanstalk with it.

EBS Replacement Saga

We worry so much about having to maintain servers individually. We believe that infrastructure should be highly replaceable and letting human hands with this task can cause delays.

EBS solved this problem as it guarantees (as far as possible) that servers will be there with software, hardware and network configuration that we defined. It was a prerequisite to our Nomad cluster, then.

EC2 itself can’t grant this, but EC2 Auto Scaling Group (ASG) can. The big problem was that the configuration of all what we needed to create the cluster could lead to much manual work/preparation.

Then Terraform came to the rescue — we opted to use it because of the previous technologies choice. Since we were using other HashiCorp tools with so much success, this one could be nice too.

Terraform allows you to manage infrastructure as code. All you need to do is write what you need, provide right credentials and it will do the job; I would like to highlight it’s speed. The snippet below shows how much time it takes to build and destroy all of our new infrastructure needs from zero:

By the moment that I’m writing, our infrastructure is composed by:

  • 1 Bastion/SSH host (EC2)
  • 6 Consul/Nomad Servers and Agents (EC2)
  • 1 Auto scaling group (ASG), also demands the creation of the launch configuration.
  • 1 External Application Load Balancer (ALB), demands the creation of some target groups and listeners.
  • 1 Internal Application Load Balancer (ALB), demands the creation of some target groups and listeners.
  • 1 External Classic Load Balancer (ELB)
  • 1 VPC with 2 subnets, 1 internet gateway, 2 security groups

We built an Terraform configuration based on this project, which explains how it works.

At this point we could:

  • Create and destroy all the infrastructure in less than 5 minutes, and the only needed human interaction is to run the command to do this.
  • Versionate infrastructure, as it is code.
  • Abstract the infrastructure from the application. (We did not test it yet, but one of our hack days will for sure try to smoothly run our application in a Raspberry Pi cluster 😋)
  • Rely on Nomad to keep the things running and Consul to allow new services to join the cluster.

Replacing AWS dependencies with OSS

At the moment we decided to start this process, we had four dependencies on AWS:

  • S3: To store static files
  • IoT: To send push notifications (due to technology restrictions)
  • SQS: To keep the queue of background jobs
  • Firehose: To transfer telemetry events to a Redshift database

Our choices of replacement were:

  • Store static files in DB or file servers (we aren’t sure yet)
  • EMQ to push notifications
  • Kafka to the backgroud jobs queue and to transport telemetry events

Each of the new services can run in docker, then the only work was to write the complete Nomad job and then just nomad run our-job.hcl

Okay, that was not that easy, not because of Nomad itself, but because of the cluster issues with the docker images that we use. We had two behaviours:

  • The image can automatically create the cluster (e.g. EMQ)
  • The image need a list of the nodes to join the cluster/replica-set (e.g. Zookeeper, Kafka dep.)

Our first shot was to send/listen broadcast messages with socat to find other nodes of the same service and create the cluster, but as we have more than one subnet, there are cases that, by the disposal of the nodes/services, it didn’t worked.

In both cases our savior was Consul — which we choose to use just because of the ease, not predicting this case. It exposes a localhost REST API to every node. It allow the access to nodes exposing some service. Then, as we say in Brazil, we “ran to the hug

We counted with the help of JSON.sh to parse the API response, as it was easier than writing a snippet of code in a language that we probably would have to install to our containers (most of them run Alpine).

Our final production structure

Our final workflow

The final deploy process is composed by three agents:

  • Github: Our choice to versionate code. When there are new pushes it’s webhooks notifies Semaphore.
  • Semaphore: When notified about new code in the repository, it builds the docker image of the application, tests it and then return the status check to Github, allowing us to merge the feature/fix branch into master. 
    When master branch’s tests pass, it automatically deploys to our Nomad cluster. The “deploy” step is basically running the job again, as it will force Nomad to pull the latest image from our private docker repository.
  • Nomad Cluster: Where the magic happens. :D
    It could be a Bare Metal + Consul + Nomad Cluster. (Or Raspberry Pi + Consul+ Nomad Cluster 😂)

Perspectives to the future

We expect that this structure allow us to in fact abstract the infrastructure in more cases than our client and our SaaS application. This is the primary goal for all this saga.

To the secondary objectives, we expect to include CI/CD of the infrastructure. As it can be versioned, we could plug it in a pipeline to automatically (with manual approval) apply it.

Also, using Terraform allow us to think about going multi-cloud. Of course it’s something to think in long term to us, but we know that it won’t be that hard.

Conclusions

  • HashiCorp definitely won a new team of fans. ❤️
  • It is possible (and viable) to abstract the infrastructure from the application, but it need the team to have this mindset and the application to be prepared to it.
  • Onion Ring Architecture was a decisive factor to the easiness and speed of this process.

That’s all, folks! Feel free to leave your comment or question!