Migrating our Rails application to AWS Fargate

Published in

Code & Wild

8 min readJan 2, 2019

Despite being one of the fastest growing companies in the UK, our Engineering team here at Bloom & Wild is smaller than you might imagine. We use Rails to power most of our backend systems, and we went through a pretty standard startup journey when it comes to hosting and deploying our code. We started on heroku then as we grew, and container technology became more mature and popular, we moved to a Docker based environment by hosting our own Rancher 1.6 cluster. Rancher served us well, but was problematic for a few reasons:

The version of Rancher we used (1.6) required us to have some custom based deploy scripts to enable us to shift traffic from an old version of our application to a new one.
We sell flowers. This means we are a highly seasonal business. The amount of pressure our systems are under during Mother’s Day are orders of magnitude higher than the rest of the year and this means we need to be able to scale our infrastructure up and down with ease. With Rancher 1.6 that was not as simple as it could be. New EC2 instances had to be added to the cluster when scaling up, and this process was not fully automated, and not as easy or straightforward as we’d like. Scaling down had the same issues.
In order for us to be able to do a blue-green deployment of sorts, we always had to keep quite a lot of spare capacity, only used when deploys happen (which in our case is still rarely more than once per day).
Last but not least, we wanted a solution that required as little maintenance as possible, as we want to focus our resources on building and maintaining features for our customers rather than having to worry about managing our infrastructure, especially for a team our size.

Meet AWS Fargate

Amongst the different solutions we looked at, AWS Fargate seemed to tick all the boxes:

It works with a model very similar to the one we already had in Rancher: Docker based containers running in a cluster. This means we could reuse most of our existing Docker files and CI/CD processes if needed.
It’s in AWS, where we have most of our infrastructure already, so it was easy to integrate with other services and we are all already familiar with AWS.
It’s serverless and fully managed, as weird as it may sound for a Docker based system. You only have to worry about how many containers of your Rails app you want to run and how much CPU and memory you want to assign to those containers, and AWS will handle the rest.

How do we run our backend infrastructure with AWS Fargate

Our current stack is pretty standard:

We have a Rails powered JSON API.
We have a few more other Rails based web services for internal purposes.
We run background jobs with ActiveJob using Sidekiq as a backend.

Below is a high level diagram of our existing systems:

AWS Fargate allows you to group your applications in what they call clusters. In each cluster you can have multiple services. Each service is then running a number of tasks, based on task definitions. And finally, a task definition is a blueprint of your application. I like to think of task definitions as powered up docker compose files: they let you specify a number of docker images to run, each of them with their own environment configuration, and tag them all with certain CPU and memory constraints. This gives you plenty of space to architect how you want to run your system. You may want to group all your background workers in a single task definition, or have one task definition per worker, or a mixture of both.

In our case, we decided to create separate services for each logical part of our stack, so we ended up with one service for the API, another one for each of our other web services, and finally one service per type of background worker. The reasons behind this configuration are essentially:

We can have more flexibility on scaling. If the API is under stress because we have lots of customers on the site, we can scale up only the number of tasks running on the API service. Similarly, if it’s our internal tools that are under pressure, we can do the same, without impacting the rest of the system. The same happens with background workers: we can fine-tune how many workers we want processing each one of the Sidekiq queues we use, or even apply more specific autoscaling policies if needed.
It provides us with a higher degree of fault tolerance. AWS Fargate will ensure a certain minimum of instances of your services are always up and running, but in case something goes wrong with a given container, having it isolated in its own service means a failure will not have an impact on other parts of the running application.
We have a better insight on the load of our systems: each one of our task definitions includes a DataDog agent (more on that later), which means we have a very good view of what’s going on inside our different services. By having each service in charge of doing only one task, we can then more easily find what parts of our system get under stress on certain circumstances, and allocate resources accordingly.

Application Load Balancers, routing requests and deploying our code

As mentioned earlier, we run a few web services using our Rails application. Fortunately AWS Fargate integrates very well with their Application Load Balancers. The journey of a typical HTTP request to our systems looks like this:

A request comes into the load balancer.
The load balancer inspects the request.
Based on custom rules (which can be based on things like the request domain name, or certain request headers), the load balancer routes that request to a target group, which is linked to a specific AWS Fargate service.
The request is then sent to one of the running tasks of that service.

And this is how we ensure that traffic gets routed to the right place in our systems. Each task definition can expose certain ports, so web services can be listening to port 3000, and each service can register itself to AWS Target Groups, and that’s how the load balancer knows where to send a request.

In the case of our workers, there’s no need to do any fancy stuff as they never get any incoming HTTP traffic.

Task definitions and services working together

Deploys also get really easy, as Fargate will manage the traffic routing for you. In a nutshell, when a new version of the code needs to be deployed, the process works as follows:

Fargate spins up new containers with the new docker image.
Once containers are up and running, Fargate will ping their health-check and mark them as “green” once the health-check is successful.
Then all new traffic will get routed to the new docker containers.
After a configurable amount of time, old containers will get deleted.

The really good news is that you get all of the above with minimum effort as most of it works by default out of the box.

Monitoring and logging

When it comes to monitoring and logging, things are relatively straightforward as well. We use DataDog to monitor our infrastructure and also to parse our logs. Fortunately for us, DataDog integrates well with not only ECS but also Fargate. In order for us to get detailed metrics from our containers, the only thing we need to do is make sure that all of our task definitions include the DataDog agent task on them.

Find below a sample snippet for a task definition that includes the DataDog agent on it (from their blog post explaining how to do the integration):

The DataDog agent is also capable of using Docker Image tags to tag the metrics, so you can then filter it more effectively in your dashboards and alerts. This gives us almost real time metrics on CPU and memory usage for all of our running containers.

As for logging, things are a bit more complicated. Fargate will only log to CloudWatch by default, and the DataDog agents in task definitions have no access to them. Fortunately, DataDog have a solution for us, and they’ve written a Lambda function that will get triggered when new log lines are written into a given CloudWatch Log Group and pipe them to their servers. We have slightly modified the function so all logs we send to DataDog are tagged with the service name, for easy filtering.

The journey so far

We’ve been in Fargate for a few months now, and so far we are very pleased with it. We have accomplished our main goals:

Scaling is easy. We just jump into the ECS console and add more tasks for our services. Job done. This can be a bit faffy, and so we are looking now at autoscaling policies, or making scaling up and down easier through scripts or Slack bots.
While we had to make some work to move us to Fargate, the fundamental technology we use hasn’t changed. We are still using the same Docker images we used and are familiar with.
We don’t have to manage any of the nitty-gritty of starting and stopping containers and making sure that deploys are downtimeless and safe. This frees a lot of time for the engineering team, that can be spent building more differentiating features for our customers.
AWS keeps improving and investing in AWS Fargate, which means we’ll slowly but surely get access to more features, some of which could prove to be very useful to us. They have recently added support for Secrets handling, for example.

All that glitters is not gold, though. As usual, making technical decisions is all about finding the right compromises. In our case this means:

We have tied ourselves even more to AWS. While it’s not usually safe to put all your eggs in the same basket, this is not something that particularly worries us as most of the underlying infrastructure is not actually that coupled to AWS Fargate and would be reasonably easy to move to a different container orchestration technology.
We have lost control over certain things. That is to be expected when you move to a fully managed service. At this stage we’d rather focus our engineering efforts on other aspects of our business rather than our tech infrastructure, so long as AWS Fargate meets all of our needs.
Containers are much slower to boot up. As is usually the case with long lived monolithic Rails applications, our docker images are not small, and AWS Fargate doesn’t seem to be doing any sort of caching. This results in containers that are quite slow to spin up. It makes scaling up quickly not really feasible, so we cannot react to spiky traffic as fast as we’d like to.
We have a bigger AWS bill to pay at the end of the month. From a pure price per CPU/memory per minute perspective, we are paying considerably more with Fargate than we would by managing our own cluster and EC2 instances. However, we save lots of engineering time, and therefore money, and time spent tinkering with risky infrastructure when scaling up and down, which for us is now far more important.