Why we migrated to Amazon Web Services
As many startups grow from idea to business, their infrastructure needs to change over time. In the beginning, founders try to use fast, managed and easy-to-use code and experiment. However, as the company changes, the infrastructure needs to change as well.
Bond Touch started life as a Heroku deployment. Heroku is a cloud Platform as a Service (PaaS), started in 2007 and was bought by Salesforce in 2010. It has been a pathway for many companies to start developing their software infrastructure. It’s a particularly good way to deploy software in the early stages of a company as Heroku fully manages deployed software and provides simple, but quite effective monitoring and instrumentation.
As we grew, the Heroku’s simplicity became troublesome for us.
- We developed ETL pipelines that required logging. We used Heroku’s add-ons to add logging to our infrastructure, but since the pipelines had a more complex architecture and required us to use AWS, integrating the pipeline’s logging with Heroku’s was difficult. So we had to resort to using a third-party logging provider.
- We had several events that required us to look into our infrastructure to determine bottlenecks. Given the managed nature of Heroku, it was really difficult to instrument dynos and perform root cause analysis.
As such, our engineering team had been working on a project over the last few months to migrate our product infrastructure to AWS. We finally archived the milestone of having 100% of the traffic going to our AWS deployment. This migration will allow us to do several things:
- Reduce costs by buying long-term resources from AWS upfront for a lower amount.
- Increase fault tolerance by having a reproducible deployment that we can easily migrate to different regions.
- Allow us to be carbon neutral on infrastructure by 2024 as per AWS’ carbon footprint commitment.
What we had
Our backend API was deployed in Heroku. “Heroku is a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud.” With it, we were able to quickly deploy new releases into the Staging and Production environments, monitor the global availability status of the server and manually increase and decrease service capacity.
Heroku provides some KPIs, as response times, throughput and memory that we can monitor and take some action if the service behaves different from some expected pattern.
Also, it offers a set of add-ons that allows us to easily integrate the service with other products. We used those add-ons for LogDNA and Redis.
Regarding deployments of new releases, Heroku provides a Git origin where we push our code into and it automatically builds according to stack and run containers with the new code version.
In addition, our service is also connected to a MongoDB database (provided by Mongo Atlas) and Firebase to integrate with authentication mechanisms, realtime database and cloud storage.
Now the problem
More than once, we had alerts with degraded performance and the troubleshooting took more time than expected. This is because we had a lack of visibility into the infrastructure that led us to take longer to find the cause of certain problems.
The dependencies we had in the add-ons led to some limitations in the management of some components. We felt that there was a lack of control in everything that was managed by Heroku, both in operational and cost terms. Troubleshooting became more complex because the KPIs ended up being scattered.
At the same time, we were already using other solutions for other backend processes. Solutions which give us more control and stability like GitLab-Runners and AWS.
One problem we identified with the approach we had is that as soon as Heroku detected a new version in the repository, it would compile and start the service as it was, even if the new version had a major problem that was detected immediately. In these cases we had to rollback to the previous version and cause application downtime.
Also, if an overload was detected, Heroku didn’t have the capability to automatically increase the service capacity.
Migrating to a new approach
We started to work on a new approach where we would have much more control over all components. For that, some requirements were made.
First, we wanted to have all backend solutions in one place. As other processes were already in AWS, AWS was the cloud technology we chose for the backend API.
Secondly, some KPIs would be defined and monitored so that we could have a global view of the service regarding its health state and ongoing costs.
Third, we would create CI/CD pipelines that are responsible to make unit tests over new releases, build and push new container images into a registry and have the ability to deploy easily.
Fourth, we wanted to have a reliable infrastructure that could be easily managed, implement periodic health checks over containers and automatically scale on overload. With this, we also wanted to guarantee that every new version was only available after a successful health check, and that the older version would still be available.
How we did it
With all of this in mind, this was the process we followed to migrate from Heroku to GitLab and AWS.
We already had a created cluster with a VPC. VPC is the virtual LAN that is used by the components that are configured there to communicate with each other. That could be other cluster services or other components like EC2 instances and Load Balancers.
We have created a secret in “Secrets Manager” that is responsible for storing all the API environment variables, like external API keys and databases connections strings. This mapped the previous “Config Vars” from Heroku.
A CloudFormation template was then built so we could easily deploy all the pieces needed for the service to work. Some components were previously created to be used in this template:
- An Elastic Container Registry (ECR) with at least one container image (new images are deployed as a result of the CI/CD pipeline that is detailed in the next section)
- Elastic Container Server (ECS) cluster
- VPC and 2 subnets
- HTTPS certificate to be used by the secured listener
- Secret with Environment variables from Secrets Manager
With these components as input, the CloudFormation template then creates the following:
- Managed policy, execution and task roles — ensuring that the resulting tasks will be able to read the secret from the Secrets Manager
- A task definition and corresponding container using the ECR deployed image
- A target group where every running task definition container will register
- A Load Balancer that:
- -Redirects every request to port 80 into HTTPS (port 443)
- Forwards HTTPS requests into a target group
- Uses the previous created certificated and is exposed to the internet
The Load Balancer has a health check configured within it that is responsible for the guarantee that the target group has a desired number of running healthy containers. It stops unhealthy containers and starts new ones when needed.
AWS provides useful information in Load Balancer pages as we can check some metrics about requests and resulted responses, like number of requests, average response time and count of each result code. Those metrics are also visible at a target context, so it is possible to detect problems for a specific container.
As part of this migration, we had to come up with solutions to replace the integration with Heroku add-ons.
LogDNA Heroku add-on works on the stdout of each container, so we had to change our source code to implement a LogDNA Appender and send logs directly from the service. We have created a new account and stored the API key in the previously created secret.
To replace the Redis add-on, we created an Amazon MemoryDB for Redis cluster in the same VPC and changed all connection strings to the newly created cluster.
The continuous delivery pipeline was something that we spent a lot effort developing. With this process, we aimed at having our versions automatically tested, built and pushed into a registry, and also able to easily deploy.
As we had all of our repositories in GitLab, we started to implement a CI/CD pipeline there, using GitLab Runners. We also created a docker image, with a GitLab Runner configured within that is responsible for building new versions of Bond-server container images:
- Check — Compile and static code analysis (lint)
- Test — Run unit tests over the compiled image from the previous step
- Build — Build a docker container image and push it to AWS Elastic Container Registry (ECR)
- Deploy — Manual action to update the service task container with the latest image in ECR
We have been using Cloudflare for some time now, and in the final phase of this process, we configured the load balancing tool to gradually deliver the requests to the new infrastructure and abandon Heroku.
Gradually increasing the percentage of requests delivered to AWS infrastructure made us evaluate the service capacity needs at all times with a high level of confidence.
As the traffic was increasing in AWS, we defined some policies for autoscaling the number of simultaneous containers running as targets of our load-balancer. Those policies were based on:
- Average requests per minute per target
- Average CPU utilization per target
- Average Memory utilization per target
If AWS detects that any of these values has crossed some limit, it will trigger an alarm and automatically start a new container.
Results and future work
If you haven’t noticed any of this happening, that’s one of the milestones on its own. That’s because we planned for this to happen without any downtime for our customers.
This will be always an ongoing topic, but for now we are clearly aware that we are saving money because we decreased the capacity for Heroku (which is still up so that we can have a backup system) and increased the value for AWS — it all costs less than what we were saving with Heroku.
Also, we now have a more visible state of each component and higher control over the release process. Every new release is only available after it is guaranteed to have a successful health check.
With the migration process taking place gradually, we are able to see that the autoscale works when a higher load is detected.
As a last step, we want to replicate the current environment into a new region in order to have a backup system and abandon Heroku completely.