Migration from Heroku to AWS ECS

victor piscue

Published in

21 Buttons Engineering

13 min readNov 26, 2018

Considerations

A few months ago we started to look for alternatives concerning where to store our backend infrastructure.

On Heroku we were running Python apps, using Procfile with web and other workers. As our user base grows, the infrastructure running on Heroku became constrained: Limits of Dynos per app, poor visibility inside New Relic, lack of debugging options when you deploy containers, public communications within the stores.

We wanted to use containers in production (not only for CI tests), in a more secure way in terms of exposure and communication with the rest of our platform and services. As we run most of them inside AWS, We considered two options: AWS EKS and AWS ECS.

We did a quick trial with AWS EKS, that is the Oficial Kubernetes approach developed by AWS. It has some nice features such as provisioning the Cluster Master by Amazon, but it lacks (at the moment to write this article) to provide the node management that needs to be self-managed and self-upgraded. A strong argument for EKS is the use of open source technology that is compatible with other providers and on-premise, although EKS use specific AWS annotations to run the service.

After considering AWS ECS: It seems a more mature option to run containers inside AWS, without using third party tools, it is a cost saver compared to Heroku and allows to run container in private environments, plus setting customs ways to scale out.

Comparision Between AWS ECS EC2 and FARGATE

Using ECS you can deploy services on EC2 or using FARGATE:

On ECS EC2:
Manage EC2 Instances
Be able to debug (docker exec) Self-managed, more maintenance Calculate EC2 allocation
Pay per EC2 Instances
Easy access to logs
With ECS Fargate:
“Serverless” and Blackbox
Managed by AWS
Just decide CPU and RAM
Pay per CPU and RAM per min
Logs on Cloudwatch

On our company we choose the newest model that is using FARGATE, it has the downside to not have Direct access to the host that runs the containers (accessing to the Docker Socket or Debug specific container), but it allows to forget about managing EC2 instances. We just run services that deploy container on hosts (presumably EC2) managed by AWS.

AWS ECS Basics

ECS is the short name of Elastic Container Service, basically is a Docker Orchestrator created by AWS.

The components follow this hierarchy :

Clusters => Services => Tasks => Tasks definitions

Clusters

The clusters are the logical group of Services and Tasks inside ECS. In 21 Buttons we set up a Cluster per environment, so the Services/Task runs on isolation (Cluster and VPCs) for each environment (development, staging, production).

Services

The services are endorsed to maintain and orchestrate the tasks/containers (restart those that were killed or stopped) and autoscale based on one metric.

For Services that has a listener, on Heroku those that called web, we autoscale based on average CPU metric. The rest of workers, if they need autoscaling we created a metric on Cloudwatch based on task provided in Django-rq, more on that later.

It also manage the deployments when you update the service, allowing to choose the best strategy for your service.

Tasks

Tasks are the unit inside ECS. They run one or more containers until they stopped or been killed.

You can run tasks under a service or manually. We only run manually to do the migrations, also more on that later.

Tasks definitions

As the name claims, defines how to run the task:

- Docker Image (ECR or Docker Hub)
- Command to run
- CPU and Memory Assigned
- Environment Variables
- Container Networking

ECR

ECR (Elastic Container Registry) is the container registry we chose to save our container images. It allows to have a private registry and you pay per storage and egress. It also integrates nicely with the IAM permissions that all AWS service take advantage of.

Let’s dive in

Terraform

We like to use Terraform for all, so ECS was also a good candidate to own some .tf files. In Terraform we provide the following:

- ECR repositories for the images: One per microservice, no distinction between environments.
- Cluster per Environment
- A “Dummy” Task definition that acts a template for later the Jenkins do their own work
- Service and Task definition for each process to be run
- Task definition to allocate the migration process
- The Autoscaling based on Cloudwatch alarms
- Alarms for the Service and Load Balancers
- The Load Balancers that forwards the requests to the microservice container listeners, We set up an ALB for every “web”’s process type

This “Dummy” Task definition

it’s based on a JSON template that declares:
- Nginx image
- It listen TCP 8000
- returns 200 on / and /health/
- Logs stream per Process type Services

[{
 “requiresCompatibilities”: [
 “FARGATE”
 ],
 “name”: “${env}-${microservice}”,
 “image”: “${image}”,
 “essential”: true,
 “portMappings”: [
 {
 “containerPort”: ${port},
 “protocol”: “${protocol}”
 }
 ],
 “logConfiguration”: {
 “logDriver”: “awslogs”,
 “options”: {
 “awslogs-group”: “/ecs/${env}-${microservice}-${process_type}”,
 “awslogs-region”: “${region}”,
 “awslogs-stream-prefix”: “ecs”
 }
 },
 “mountPoints”: null,
 “volumesFrom”: null,
 “hostname”: null,
 “user”: null,
 “workingDirectory”: null,
 “extraHosts”: null,
 “ulimits”: null,
 “dockerLabels”: null,
 “healthCheck”: {
 “command”: [
 “ls”
 ],
 “interval”: 30,
 “timeout”: 30,
 “retries”: 3,
 “startPeriod”: 0
 },
 “volumes”: [],
 “networkMode”: “awsvpc”,
 “memory”: ${memory},
 “cpu”: ${cpu}
}]

This allows to Terraform start all the cluster with himself configuring the Load Balancer and all the process types, It’s a starting point for Jenkins being able to deploy the code on already running cluster/services, and separate the boundaries between code and infrastructure.

To not enforce the current cluster setup only via Terraform, we ignore changes of the Task Definitions and Current Desired tasks of the services. This allows for Jenkins we able to Change the Task Definitions or via Console scale at our own the current containers running without Terraform overwriting this for every terraform apply.

Jenkins

This is the part where the party start and unites altogether. To allow a smooth migration we set up a custom pipeline that deploys both Heroku and ECS, so we can test on ECS that all runs smoothly and decide when to point to the new service.

We like to be clean and tidy, and we also use a shared library to create all this wizardry: To upload to ECS on Jenkins and not mess on the code repositories, so pure developers are happy.

Jenkins has mandated too many things. Let’s start:

Entry points

Before building the container that has the code, we create entry points for web and the rest of workers.

The idea is not to mess to backslash or do workarounds on each command that containers need to run, so we parse the Procfile and create entry point files (entrypoint-PROCESSTYPE.sh) for every worker with the commands inside.

for PROCESS_TYPE in `cat Procfile | cut -f1 -d\:`

The full path of the entry point is what the container will run when start. Those files will be copied at the build stage of the container.

Deployment

When we deploy on each stage, we do several things related to ECS

Upload to ECR

Our images URL will be constructed using the following pattern:

For development environment:

$REPO_NAME:$BRANCH_NAME-$BUILD_ID-$GIT_COMMIT

These images are created for every pull-request or push to them. We also set a lifecycle to not have infinite images stored that we don’t use anymore.

For staging and Production:

IMAGEECR=$AWS_ECR_URL/$REPO_NAME:master-$GIT_COMMIT

The image for staging and production will be the same, to test it properly. We also have a different life cycle for them to ensure we can rollback for safety. Rollback also later on.

Vault for Vars

In any case, we need a way to create and manage secrets and vars. On Heroku, we used the environment variables that you can set for every app. ECS has their own environment variables inside the Task definitions, but we wanted to automate it better and centralize, to not enforce to change many Task definitions if we want to change a var on a microservice.

We decided to use Vault from Hashicorp. So before to trigger any pipeline that deploys to ECS, we expect to have some vars set on Vault to be able to retrieve at this stage. We set vars for every microservice and environment, so we can point to different Databases, Redis or hosts using the same code on containers.

We download a JSON of those variables that we are going to use later.

Retrieving actual ECS state

As we set up our cluster using Terraform, we need to know what’s there in order to deploy from Jenkins

For every worker or listener, we download the current tasks definition in JSON format from ECS. This allows forgetting if the actual definition was set by Terraform, Jenkins or change via AWS Console.

Python wizardry

The real fun starts here and Python will use his powers. We created a script that gathers:

- The Task definition we gather from ECS
- The image URL upload to ECR
- The command that container will run
- The JSON from Vault with the actual variables.

This Python will shake all of this data, to create a new JSON with the new Task Definition that is going to be registered as a new revision.

Registering a new Task Definition is not yet the deployment, We need to Update the Service pointing to the new revision to do it.

Migrations

As any deployment, we need to be aware if a migration should be applied before the new code to be deployed. As most of our code is made in Django, we use the Django migration.

Using the same approach as Heroku, we use the “release” command in the Procfile, if that exists (so the entry point for release exist), will do the migration.

The migration by itself is not a service, it just runs the container with the command of the release and waits until the end. If the container exits well, will continue the pipeline, if not, will it crash.

Also, we print the logs of that container to know what it did, using a specific AWS tool called ecs-cli

Finally, the Deployment

If the migrations stage went well, we deploy our brand new containers. To do it we just need to update our Service with the new revision of the Task Definition we constructed with Python.

Jenkins Notes

We found an unexpected behavior when we deployed in parallel to Heroku and ECS regarding the code on the Pull Requests:

The code deployed to Heroku was only the content of the branch.
The code deployed to ECS was the content of the branch merged to master.

This behavior can create confusion depends on how you code and test the code. If you code locally and expect a certain result, this can be different is that code is merged to master, so be aware.

This was introduced by Jenkins that does the checkout scm and this the code that generates the container. Uploading to Heroku was a git push of the specific branch.

Autoscaling

For the web workers are based on target CPU, so they are flexible depending on the current load, but for other workers, we like to get the current queues to do the autoscaling, in Heroku we were using Hirefire, that can read from Django-rq and autoscaling dynos based on thresholds.

So what we did? We created our own Hirefire using lambda :)

As we are using vault to set vars, we set the Hirefire token there.

Using a lambda function, we look for all microservices in the cluster to get each Hirefire token defined on each task definitions. We build the URL and download the JSON exposed the web.

Having this JSON we are able to know which are the actual queues and put those metrics on Cloudwatch metrics.

Coming back to Terraform, we have a metric to set the Autoscaling.

Hirefire complete!

Backdoor

As we are running on FARGATE, we were unable to debug the actual running containers in case of need. So we also created our own backdoor.

For each microservice, we run an extra service called backdoor: They are only accessible via VPN (all the container run the private subnets after the Load Balancers). They run an especial command to be able to join those containers so with the same variables as the other containers. This is quite useful to run commands or debug specific code without influencing the containers.

To be able to track the current IP of each backdoor container we set up them using AWS Service Discovery, this runs under the hood of AWS Route 53 and allows easily create a hostname for services that change overtime and ECS Services has direct support to it.

Container Logs to S3

Also, in Heroku, we were using an addon for Papertrail. This was allowing us to parse and filter the logs of the Api-gateway for analytics and triggering alarms. As we were not running inside EC2, so we can’t access to docker socket, the only way to have the stdout of the container is via Cloudwatch.

As you may guess, we created our own Papertrail:

Each process of each microservice (as mentioned before, for each microservice we created web process for HTTP api requests and worker processes for background tasks) creates a LogGroup in Cloudwatch that will log all stdout information that the application prints.

For analytic purposes, we needed to collect only the api-gateway microservice HTTP requests (from web process) and store them in S3. Once uploaded, the analytics team will run their tools by downloading the logs from S3.

We took advantage of Subscription Filter feature from Cloudwatch Logs to stream logs to a custom Lambda. This association (between Logs and Lambda) allows us to re-format the logs as analytics team likes, filter logs that do not belong to api-gateway requests (since CloudWatch Logs prints everything from containers even python StackTrace errors, we say to the Lambda the logs we want), and send them to S3 in a time format prefix.

Because the streaming of logs is almost in real time, the Lambda is invoked several times per second containing between 4 and 8 logs for each. For this reason the Lambda function must be as light as possible with a duration of few milliseconds. Basically, this is what the Lambda does:

- Uncompress log coming from Cloudwatch Logs
- Check if log doesn’t match a blacklist
- Format log with the information needed
- PUT stream into S3 bucket

Using a different Lambda function triggered by an every-5-minute Event Rule will compact the streams into a single compressed file used by analytics.

Monitoring

We set up many alarms for the containers and LB’s to us aware of 5xx, memory, CPU and so on.

Apart, we have almost all infrastructure inside AWS, we can have better visibility of each microservice. We set up some Grafana’s dashboards that includes all ECS related + LBs + Redis + RDS + Queues.

Also on the Jenkins pipeline, we mark the deployments on Grafana, so we can relate deployments to metrics very easily.

Changes on code

For the web listeners, we did some changes on the code for two things:

health check: We created specific endpoints for the health check, this allows to be nicely integrated with the Application Load Balancer and mark which hosts are healthy or not

stdout output: We unified the way of “web” container prints output of each request, so if we need to filter them on Cloudwatch Logs we are able to do it easily.

Rollback

On Heroku, rollback to a previous code was fair easy.

If you only have one service per microservice you can easily do it selecting another Task definition revision in the service, but having many working on each microservice with the own revision numbers it became more complex or susceptible to human errors.

We set up a custom job on Jenkins that allows selecting the lastest deployments for each microservice per environment. Using Python again we are able to rollback to a specific revision for all workers on the selected microservice.

This rollback job also creates a red mark on Grafana.

AWS downsides

The main bottleneck to run this migration was to wait until amazon increase service limits for certain services per region:

- Specific FARGATE tasks
- Load Balancers
- Network interfaces

Network planning strategy

We also we had to redistribute our subnets and private IP addresses. For each container running there an associated IP address that runs on a subnet, our network configuration was with 3 private subnets per VPC, this wasn’t enough for running our infrastructure. so wa had to increase this private subnets and configure the ECS service to use them and be able to allocate more private containers

Improvements

Secrets: as we using Vault to write vars on the Task definitions, those are exposed to anyone that can read them. It might be a better option to use another service to manage secret at the startup of the container.

Understand when a deployment is finished: When we update the service, on the pipeline, we cannot ensure this deployment has been finished or in the worst case successful. So we need to check on the AWS ECS Console to check the deployment run and finish.