Behind the scenes: Migrating Rapido to a scalable infrastructure

Rapido Labs
Rapido Labs
Published in
6 min readApr 29, 2020

In the last part, we explored some of the challenges (link) that Rapido was noticing with its platform. In this part, we will cover the process of overcoming these challenges by migrating into a highly advanced and scalable infrastructure without affecting the services.

With a fast-growing user base of 10 million active customers and 500,000 captains regularly accessing the app and the services, Rapido had to quickly scale its platform to serve all the users.

At the time of writing, Rapido operates on a Kubernetes cluster consisting of ~80 services, ~50 nodes, and ~250 pods.

But to reach that peak moment, Rapido had to be extremely diligent in managing the transformation. The entire tech team along with buy-in from the business leadership set out to find the “optimal” solution that would cater to their needs and gently migrate their services to the new infrastructure.

A look into the existing infrastructure revealed that some of the critical workloads of Rapido were hosted on multiple cloud service providers like Digital Ocean and Google Cloud Platform. All of these workloads existed on VMs which were manually setup.

There was a clear sight that the entire process needed to be structured and automated to scale. The idea was to take a step toward containerization and the low-touch process through deployment. Application build, deployment, and infrastructure would be closely linked to the coding blocks.

Finding the right tooling

The process started with consciously identifying the available options and choosing the one solution perfect for the problem. After several deliberations, we decided to go with Kubernetes.

Some of the reasons why Kubernetes was our top pick:

  • Auto-scaling: We could automatically change the number of running containers, based on CPU utilization or other application-provided metrics. At the same time, we could also manually scale the number of running containers through a command or the interface.
  • Automated rollouts and rollbacks: The rollouts for new versions or updates are usually without downtime as Kubernetes would monitor the containers’ health. In situations where the rollout doesn’t go well, it would be rolled back automatically.
  • Canary Deployments: We could test the new deployment in production in parallel with the existing version. It allows us to mitigate the risk by having deployments rolled out incrementally.
  • Traffic routing and load balancing: We could automate sending requests to the appropriate containers. Kubernetes also comes with built-in load balancers so we could balance resources to respond to outages or periods of high traffic.
  • Auto healing: Kubernetes automatically checks the health of nodes and containers ensuring that the application doesn’t run into any failures. Kubernetes also offers self-healing and auto-replacement so that even if a container or pod fails, it takes care of restarting the pods so that we wouldn’t need to worry right away.
  • Service mesh: With a homogeneous infra facilitated by Kubernetes, we could leverage the best of service mesh which gives deeper insights into what’s happening across the infrastructure with respect to the communication. More on the service mesh and how it helped Rapido in further blogs.

Getting ready for migration

With the solution in place, we were ready to migrate. However, to get started, there were apprehensions on what the starting point should be — whether to create a completely new environment and then migrate it into production or just start migrating services into the new environment as and when the services are ready to be deployed.

Given the critical business and expectation to have minimal disruption, the team decided to keep the existing infrastructure untouched. However, we created a new infrastructure with the necessary automation and best practices in place to manage repetitive tasks and handle unexpected surges.

Migrating to Kubernetes

We started our migration effort in a phased manner. We began by containerizing all the existing services. We then began deploying them to a series of staging environments hosted on Kubernetes. We then systematically started moving all of our existing services to Kubernetes.

We automated the cluster provisioning on our cloud service provider. While initially, we were running everything in one universal node group, we decided to separate the workloads into different scopes and types of instances. This helped us make better use of the available resources.

We even created tools to automate the dockerization of all the services spread across multiple programming stacks. Generation of docker files, helm charts, CI/CD build configuration. All of these are generated with a single command whenever the existing service is to be migrated to a Kubernetes environment. This helped in speeding up the migration along with providing consistent ways of how the docker files, helm charts are written for every service out there.

The critical thing that we had to pay attention to while preparing for migration is to review and streamline the configuration of the services in a standard way. Even one single misconfiguration could have led to disaster. This is an area a lot of effort went into during the migration. We made a few mistakes early on and learned the lessons a hard way.

While we had all the groundwork done to have an existing service deployed into Kubernetes environment with all the automation in place, the challenge was to ensure we route the traffic into Kubernetes environment in a phased manner and have all the hooks and levers available to rollback at any given time if we see things not going in our way. We took an approach of using NGinx load balancers to achieve this. Every service in the existing infrastructure had an NGinx fronting it and all other services which depend on this service were routed to this NGinx. We used the same NGinx to route part of the traffic to the Kubernetes environment using NGinx service weights for backends. This allowed us to have very fine-grained control on how and where to route traffic to at every service level.

There are many other things we had to put in place before we started migration like better observability stack, service mesh, centralized logging, better access controls in place so that teams can manage their deployments, container registry isolation, versioning. Watch out this space as we would cover these topics over this series of blogs.

Slowly and steadily, we had migrated the services to the new infrastructure. The results were visible fairly quickly.

The results

Before Kubernetes, it required a lot of context and care to deploy any changes to the production or any other environment. There wasn’t a standard way of branching or config management across services. If a single service is deployed on multiple nodes (VMs) it needs a deployment into all those nodes manually and is a recipe for disaster because of human errors. Rollback was not quick as the same process would have to be reversed across all nodes. Upgrades to the underlying operating system patches were hard, the heterogeneous environment made it hard to monitor the system. Horizontal scaling used to take minutes as opposed to a few seconds in Kubernetes. There wasn’t any such thing as auto-healing if the process crashes or the system goes out of memory or resources.

With Kubernetes all of these are handled out of the box and developers can focus on building on their ideas to add business value and customer experience instead of focusing on how to deploy and manage infra, It takes few seconds for any new service from creation to deploy in prod with a completely automated way of deployment into the Kubernetes infrastructure. Every service now gets monitoring, centralized logging, centralized config management and secrets management, auto-scaling, and healing by default without any additional effort.

This is the second part of an ongoing series that explores the transformational journey at Rapido. In the next part, we will see the learnings from the migration and the impact it resulted in.

Also, we are always looking out for passionate people to join our Engineering team in Bangalore. Check out the link for open roles: https://bit.ly/2V08LNc

--

--