Gradually switching traffic from Heroku to EKS

Adam Medziński
Fresha Engineering
Published in
6 min readJun 25, 2019

This is the first post in a series about the challenges we encountered during the migration from Heroku to AWS. In this part, we focus on how gradually, without causing any downtime we have switched all the traffic to our EKS cluster.

Why Heroku was no longer suitable for us

On the Internet you can find a lot of blog posts about why another startup moved from Heroku to X. This is not surprising —every application eventually reaches the point where it’s current platform is too restrictive. In our case, the following things were too problematic and needed a change:

  • Heroku provides very limited monitoring options (compared to Kubernetes)
  • Nobody really understands Heroku’s CPU shares
  • Limited flexibility of Heroku’s offer (few options in terms of sizing)
  • AWS provides wider scope of services (e.g. CDN). Being on Heroku we had to buy them from different providers
  • Limited configuration options of Heroku PostgreSQL (no admin credentials)
  • Heroku PostgreSQL databases are publicly available (if you don’t use Heroku Private Spaces)
Photo by Robin Pierre on Unsplash

General architecture and technologies used

The best option for us was to introduce some kind of proxy in front of both Heroku and AWS infrastructure. Below is a diagram showing the architecture of our solution. Using the fact that Heroku also runs on AWS, we decided to spawn an EKS cluster and proxy in our new AWS infrastructure in the same region in which Heroku launched our dynos. Thanks to this we were able to add a load balancer without any noticeable latency overhead that was able to switch traffic between both platforms.

Architecture overview

We decided to use HAProxy because AWS native load balancers do not support weighted load balancing and we needed to gradually switch the traffic. To minimize the latency overhead of double TLS termination (first on HAProxy and then on each platform) HAProxy was set in TCP mode and SSL was terminated on Heroku and ELB before the EKS cluster. As you might expect, we have different applications exposed under separate FQDNs, and we want to switch them one by one. How does HAProxy distinguish one from another if traffic is encrypted, you may ask. The individual applications were recognized by the load balancer via SNI. The migration of individual applications started by changing its DNS entries to point to HAProxy. After successful migration we changed entries to point to ELB before EKS directly.

Of course you don’t want to run a single instance of critical network element like a proxy. High availability setup consists of multiple HAProxies in different AWS Availability Zones. Some kind of load balancer should be sitting in front of them and we chose AWS Network Load Balancer. We finally ended with many layers of load balancers, however each of them had a simple setup (i.e. each HAProxy server worked independently — we resigned from clustering them because floating Elastic IP address between multiple HAProxies can be even more complicated).

During the migration of a single application, we had to deploy it simultaneously on Heroku and AWS. The architecture of the components of our system came to our aid — it required those individual components to be released sequentially in the defined order. This enabled us to create two independent Jenkins pipelines. The first one was responsible for the release on Heroku and if it was successful, it triggered the release on AWS. Applications were migrated in reverse order from how than they were released which significantly simplified the whole release — as one application after another migrated, the number of apps deployed on Heroku decreased while the number increased on AWS.

Other considered solutions

The above solution is not the first one we considered. At the beginning of the project we tested Cloudflare Load Balancing, but we were not satisfied with latency overhead. The other possible option was to use a DNS load balancing but the idea unfortunately fell through due to DNS propagation lag which would lengthen the rollback time, and this was not acceptable for us.

Advantages and disadvantages of the solution

Advantage #1: “Canary” migration

It is an extremely safe approach. We gradually introduce a new environment to handle production traffic. Possible mistakes can affect only a small percentage of customers and rollback is a matter of seconds (the HAProxy configuration reload time).

Advantage #2: Easy capacity planning

Because we have control over the amount of traffic routed to AWS, we are able to ad-hocly tune the application container resource requests and limits (if current values are not enough).

Advantage #3: Small latency costs

Before starting actual migration we tested how much it will really cost us to have an extra hop between users and our applications. Based on the results we didn’t see any significant difference in latency up to 99 percentile (20ms at most in one test run). Below are two latency plots for application on Heroku — first is without additional load balancer and second is with one.

Direct Heroku connection latencies
Extra hop (HAProxy) connection latencies

Disadvantage #1: The solution adds maintenance (it’s not “as a Service”)

The solution created by us is not the simplest one. The previous diagram is obviously simplified for readability purposes. The change in the percentage of traffic split between Heroku and AWS was deployed by appropriately prepared Ansible playbook, which was not the easiest to run properly. Due to the fact that the whole process was intended to exist only during the migration, we decided to create a dedicated Jenkins pipeline to change the HAProxy weights. You also have to remember that the whole setup was prepared and maintained by us, so it required a lot of effort to properly configure monitoring, provision HAProxy hosts and automate the whole infrastructure.

Disadvantage #2: Routing the traffic through private HAProxy results in the loss of the source IP

Maybe this is not immediately visible, but HAProxy in our solution was not publicly available (it didn’t have a public IP) — at the same time it was responsible for load balancing traffic between two public domains. Traffic leaving HAProxy was routed to the internet, therefore moving through AWS NAT Gateway causes loss of the client source IP (as you might expect, spoofing source IP is not an option for serious cloud providers and ISPs 😀). HAProxy was set to TCP mode and all traffic was encrypted; it was not possible to add an X-Forwarded-For HTTP header. Also we couldn’t enable PROXY protocol because in front of EKS we still had a layer 7 ELB, which does not support it. For applications for where knowledge of the source IP is necessary for operation, you would have to switch HAProxy to layer 7, decrypt whole traffic to add necessary headers and encrypt it again — which would have impact on overall latency.

Conclusion

Finally, the migration took place without any problems. There was no downtime and no user experienced latency increases caused by additional load balancers layer. Switching the first services took longer (as expected) because we were learning how the whole system would behave in the new environment. Migrations were prepared and executed one at a time. In the later migration period, many works were paralleled and switching applications were made day by day.

The next step in the migration is moving the databases from Heroku to AWS, but this is a topic for another story.

--

--