The challenge of scale: nonstop growth with Kubernetes and Helm

Everyday at iFood, we make over a million customers happy in Brazil and Colombia. And we keep growing on a daily basis!

Published in

iFood Engineering

5 min readJan 28, 2021

As every unicorn startup, we started small. But how to keep growing on such a steady pace? And, most importantly, how to keep growing without:

Impacting customers, restaurants and drivers?
Preventing our Engineering team to deliver new features every day?
Breaking the budget?

Among several possible answers, we chose to migrate our Amazon EC2 infrastructure to a Kubernetes (K8S) model. In this article, we’ll briefly describe our migration strategy and how we succeeded in it, while migrating [5]:

550 applications
~3500 Amazon EC2 instances
+1000 deploys/month

Why Kubernetes (K8S)?

Maybe you’re already familiar with the concept of Virtual Machines (VMs), which we used to take advantage of on the Amazon EC2 cluster.

EC2 is a perfectly fine solution, but as for our needs, we weren’t getting the scale we needed at the time we needed, and the costs were getting way too prohibitive.

Why is that? Let’s take a look:

A Kubernetes cluster is still made of several EC2 machines, with the same specifications as we had before, but now running several applications on each VM, leading to better resource utilization, faster application startup and slashed down costs.

What if we needed to scale 4X one of our heavyweight applications?

Let’s take a look at it:

On average we can save 40% of the costs while keeping the same service level, even using more application instances (that means a better granularity while scaling).

And why Helm?

Helm is an abstraction on top of Kubernetes, making the conversion of EC2-hosted applications to K8S pods a breeze.

If you are familiar with Kubernetes, you surely know how painful it can be to deploy an application from the ground up. Several YAML files, code repetition here and there for every application and so on.

Thanks to the stars of our SRE team, we have a definition of how an iFood application is expected to be deployed, and this is enforced by means of the Helm Charts.

Helm Charts are templates of application deployments (YAML files) that allows us to deploy any application to the K8S cluster in a standardized way, using a single file as a parameter!

This single file can look like this:

Some of the functionalities our iFood chart allows us are:

Define scaling policies (based on CPU and/or external metrics like Amazon SQS)
Automatically get secrets from encrypted sources
Setup environment variables
Specify cron jobs
Expose monitoring metrics
Assemble monitoring dashboards

All of this skipping boilerplate code for every application, just filling strictly the needed information.

Lift and Shift

The migration process we used can be summarized in a 4-step process:

The main goal of this process was to migrate our applications while still processing millions of transactions and keeping it (mostly) transparent to our developers.

Each of those steps can take an article of its own, so for the sake of simplicity we’re making it short. But in case you’re eager for it, we’ve left some references ([1][2][3][4]) at the end of this article, enjoy!

Show me the data

Went down from ~3500 EC2 instances to ~900 K8S nodes divided among 14 clusters [5]!

Breaking down the migration only of the iFood’s Logistics applications, we’re speaking of around $4.5K of monthly savings (look at our heavyweight champion in the last column):

Deploys are significantly faster, also (before and after):

And we’ve got some fine-grained scaling, as shown on a daily profile of one of our applications:

That said, migrating to Kubernetes was a must-have in order to unlock our scalability needs while keeping costs under control. Even heavyweight applications got the benefits of this migration, while we’re refactoring them to better fit into a K8S environment.

Using Helm allowed us to get started really fast, preventing error-prone writing of setup files and getting now an uniform deploy environment.

Integrating Helm to the CI/CD flow, while keeping the old EC2 deployments was fundamental to get instant feedback of the behavior on both environments, while not preventing the deployment of new applications’ features.

And finally, blue-green deployments allowed us to get fast feedback on the migration status, keeping the risks under control. In short: go for it (and order an iFood in between your deployments)! 😄

References

Some things you might find interesting: