Migrating from ECS to Kubernetes

Published in

CasaOne Engineering

6 min readAug 17, 2020

We migrated from ECS to Kubernetes (AWS EKS) a few months ago. In this post, we’ll talk about the why, how, the surprises, and our learnings so far. We hope this helps folks exploring Kubernetes or EKS, especially ones who are planning to migrate from ECS.

This post was not generated by GPT-3, please excuse the typos :)

Why did we migrate?

We were running our services in AWS ECS for a while since CasaOne started. Our setup had the following problems due to a combination of our tech debt and shortcomings of AWS ECS.

We were using one cluster per service(eg: orders, inventory, website) initially. This was very inefficient in terms of resource utilization.
We were using load balancers for internal services as well. There was an opportunity for cost optimization by using internal service discovery and hence reducing the number of load balancers.
Deploying stateful services wasn’t straight forward in ECS. We were running observability services like Prometheus & ElasticSearch(for APM) using docker-compose on EC2 instances to keep it simple. We needed a platform that has out of the box support for stateful services.
We had a mix of manual setup & shell scripts from initial days to using Terraform for newer services. This was a bit unorganized and inconsistent.
We had created a lot of internal Terraform modules and tooling for abstracting service deployment. This turned out to be more cognitive load for onboarding new developers due to a lack of documentation and support for these modules. We needed a platform that already has good tooling, knowledge base, and community.

We needed a bit of an overhaul to fix these issues and scale for the future. We decided to go with Kubernetes as our deployment platform - which has become the ubiquitous choice for the container orchestration recently.

We were also aware of warnings about Kubernetes becoming White Elephant for a small team like ours. We kept this mind while choosing the tools and technology for our journey.

Planning

We started planning this with the following series of decisions on the platform and tools to achieve our goals

AWS Managed Kubernetes(EKS) vs self-managed cluster in AWS using tools like kops: This was an easier choice considering our team size.
Terraform vs eksctl for cluster setup: We chose Terraform in favor of infrastructure as code and based on our earlier experience with Terraform.
terraform-aws-eks for EKS setup vs writing our own module: This works well with terraform-aws-vpc module for VPC setup.
Helm vs using kubectl directly.
aws-alb-ingress-controller vs nginx-ingress-controller for ingress routing: This didn’t work well for us, more details at the end of the post.

We were all set to try this in our test environment and there was a big surprise waiting for us

This was a huge bummer! After careful deliberation, we decided to move out of the N. California region in AWS. At a high level, we considered access to the latest AWS services, reduced cost, and more Availability Zones for better availability. This decision made our migration more challenging!

We will skip the nuances of AWS region migration in this post

We laid down the following points before starting this work

Migration should have close to zero downtime.
Migration of the production environment should be done during a low traffic period to reduce any adverse impact on the user experience.
Migration should be well tested on the testing environment before production.

Execution

We started with helm charts for low traffic platform services in our test environment. After we gained good confidence in deploying these services, we paired with teams owning other services to write their helm charts and rolled it in our test environment. We observed and fixed issues over a few days.

We created a playbook for the switchover & rollback and did several runs of this playbook in a test environment. We automated the time-consuming manual steps to ensure this process is fast. After a few weeks of stable application in the testing environment, we came up with the day and time for the production migration and informed relevant stakeholders about this.

The playbook for switchover included following steps at a high level

Reduce the TTL for DNS entries pointing to services in ECS cluster
Deploy services in EKS cluster with cronjobs disabled
Disable cronjobs in ECS cluster services
Switch DNS entries to EKS cluster services
Enable cronjobs in new services

The playbook for rollback followed similar steps to switch back from service in the EKS cluster to service in the ECS cluster.

A simplified diagram is shown below

Switchover diagram — Switch over and Rollback representation

We completed the migration with ~3 minutes of disruption. The disruption was mainly related to the caching of DNS entries in existing user machines.

A high-level view of a deployment environment

Issues after the migration

Although we pulled off AWS region + EKS migration with minor disruption, few issues surfaced after using it for a while in the production environment

Intermittent 503 errors after deployment of service: This was due to an issue in aws-alb-ingress-controller which didn't update the ALB target group with the IP address of new pods and traffic was still being routed to the terminated pod IP addresses. We switched to nginx-ingress-controller to overcome this issue.
CPU throttling due to misconfigured resource allocation: It took us a while to tune requests and limits configuration of containers w.r.t total CPU available in the EKS cluster. Unfortunately, there is no general guideline for this and you’ll have to observe the traffic pattern of your services to get this right. It is a continuous process — observe, alert & fix.

Benefits

By using service discovery in k8s, the number of load balancers reduced from 14(at that point in time) to 5. We are still able to serve newer services without the need for the additional load balancer.
The total cost of our EC2 instances reduced by ~20% and allowed us to use larger EC2 instances with better CPU and networking capabilities.
We now have a uniform way of deploying all the services. This has increased our infrastructure automation coverage to ~90%.
Community driven Helm charts made it very easy to install and configure commonly used OSS services like Prometheus, EFK stack, etc.
We are making use of k8s cronjobs to replace our earlier approach for scheduled tasks. This helped to reduce the uneven load on web traffic containers.
Tools like k8s dashboard, lens provide very good visualization of the resources and utilization of the cluster.
Using namespaces allowed us to logically separate the services of different product stacks within a single cluster.

Gotchas & Tips

Helm retains only a few releases in history by default. This limits your ability to rollback when there are multiple failed releases. We have set history-max to 100 in helm deploy script to overcome this.
Helm upgrade --install doesn’t work if the first deploy fails. You will have to delete the failed deployment to overcome this issue.
kube-ps1 is very useful to add k8s context to shell prompt.

Conclusion

Compared to a few years ago, Kubernetes has come a long way in terms of documentation(outside GCP), managed service providers, and the tooling that makes it accessible for small teams. Overall, it was a pleasant surprise and experience for us. This exercise also helped us clear a good amount of tech debt in our infrastructure.

We have only scratched the surface k8s landscape, there is a lot more to explore and learn!

Thanks to

Gaurev Katoch for playing a major role in planning and execution.
Madhusudan Kagwad for the review and feedback on this post.
All our colleagues who are part of this journey.

PS: We are hiring. If you are interested in solving challenging engineering problems like these, join us by applying here