Why and how we migrated Preply to Kubernetes
In this article, I’ll share our experiences migrating the Preply platform to Kubernetes, how and why did we did it, the difficulties we’ve faced, and the benefits we’ve seen since the migration.
My name is Amet Umerov and I’m a DevOps Engineer at Preply.com. Let’s get started!
Kubernetes for Kubernetes? No, for business requirements!
There’s a lot of hype around Kubernetes. While many people say it will solve all your problems, there’s much discussion about why you shouldn’t use Kubernetes: some say you should avoid it because it’s not a silver bullet.
But that’s a discussion for another article. Let’s talk about a little bit about business requirements and how Preply worked before the Kubernetes era:
- When we used Skullcandy flow, we had a pool of features merged to the
stage-rc
branch which was deployed to the staging environment. The QA team tested in this environment, the branch was merged to the master, then deploy to the production was started. It took 3–4 hours to test and deploy to the environment, and we were able to deploy 0–2 times/day. - When we deployed broken code on production we had to revert all the features in the scope. It was also hard to find which task was causing the problem that broke production.
- We used AWS Elastic Beanstalk for application hosting, and every Beanstalk deploy ran 45 min (all pipeline with tests ran 90 min). Our rollback time to the previous app version was 45 min.
To improve our product and processes, we wanted to:
- Migrate to microservices
- Deploy faster and more often
- Be able to roll back faster
- Change our current development flow because our old one wasn’t effective anymore
Our needs
Changing the development flow
To implement features using our previous development flow, we had to create a dynamic staging environment for every feature branch. In our old Elastic Beanstalk configuration, this was complicated and expensive. We needed to create environments that:
- Were easy and quick to deploy (preferably containers)
- Worked with Spot Instances
- Were as close as possible to production
We decided to change to Trunk-Based Development. With Trunk-Based Development, every feature has a separate branch and feature branches can be merged directly into the master branch. The master can be deployed anytime.
Deploying faster and more often
The new Trunk-based flow allowed us to deliver features to the master one by one. Because of this, we could find broken code quickly and revert to functional code easily. However, we still had long deploy (90 min) and rollback (45 min) times. That gave us a limit of 4–5 deploys per day.
We also faced challenges using SOA with Elastic Beanstalk. The most obvious solution was to use containers with any container orchestration. We already used Docker and docker-compose for local development.
Our next step was to research popular container orchestrators:
- AWS ECS
- Swarm
- Apache Mesos
- Nomad
- Kubernetes
We decided to use Kubernetes. Every other container orchestrator had drawbacks: ECS is a vendor-lock solution. Swarm leans back to Kubernetes. Apache Mesos is like a spaceship with their Zookeepers. Nomad sounded interesting, but it’s inefficient to use without infrastructure based on Hashicorp products. Also, there are no namespaces in Nomad’s free version.
Despite the steep learning curve, Kubernetes is the de facto standard in container orchestration. It can be used as a service on every large cloud provider. It’s in active development with a huge community and strong documentation.
We expected to complete our migration to Kubernetes in 1 year. Two platform engineers without any Kubernetes experience worked half-time on the migration.
Starting to use Kubernetes
We started with Kubernetes’ proof of concept, created a testing cluster, and documented all of our work. We decided to use kops after Amazon’s EKS support became available in Europe starting in September 2018.
We tested many things, including cluster-autoscaler, cert-manager, Prometheus, Hashicorp Vault and Jenkins integration. Also, we played with rolling-update strategies for the self-hosted cluster when we updated our test cluster. We had DNS issues and a few network issues related to AWS and cluster troubleshooting.
For cost optimization, we used Spot instances. To check Spot instance issues, we used kube-spot-termination-notice-handler. We found out that we could use the Spot instance advisor for checking the frequency of spot instance interruption.
We started the migration from Skullcandy flow to Trunk-based development, where we ran separate stages in Kubernetes for every Pull Request. This reduced feature delivery to production from 4–6 hours to 1–2 hours.
We used a testing cluster for these dynamic environments, and every dynamic environment was in a separate namespace. Developers had access to the Kubernetes dashboard for debugging.
We started to get value from the testing cluster 1–2 months after launching our proof of concept, a result we’re proud of!
Staging and production clusters
Here is the set up of our stage and production clusters:
- kops and Kubernetes 1.11 (the latest version of kops available at the time of setup)
- 3 master nodes in different availability zones
- Private network topology with a dedicated bastion host, Calico CNI
- Prometheus on the same cluster for metrics with PVC (we don’t need long-term storage for our metrics)
- Datadog agent for the APM
- Dex+dex-k8s-authenticator to provide developers with access to the stage cluster
- Staging nodes run on the Spot instances
During the operation of the clusters, we ran into some problems. For example, the versions of the Nginx Ingress and Datadog agent were different on the clusters. Because of this, the staging version worked fine but there were problems on production. To solve the problems, we made the staging and production clusters exactly the same.
Migrating production to Kubernetes
Now that staging and production clusters were ready, we began the migration. Here is the simplified structure of our monorepo:
.
├── microservice1
│ ├── Dockerfile
│ ├── Jenkinsfile
│ └── ...
├── microservice2
│ ├── Dockerfile
│ ├── Jenkinsfile
│ └── ...
├── microserviceN
│ ├── Dockerfile
│ ├── Jenkinsfile
│ └── ...
├── helm
│ ├── microservice1
│ │ ├── Chart.yaml
│ │ ├── ...
│ │ ├── values.prod.yaml
│ │ └── values.stage.yaml
│ ├── microservice2
│ │ ├── Chart.yaml
│ │ ├── ...
│ │ ├── values.prod.yaml
│ │ └── values.stage.yaml
│ ├── microserviceN
│ │ ├── Chart.yaml
│ │ ├── ...
│ │ ├── values.prod.yaml
│ │ └── values.stage.yaml
└── Jenkinsfile
The main Jenkinsfile
contains a map for the microservice name and its directory. When the developer merges PR to the master, the tag is created in GitHub and this tag deploys using Jenkins, according to Jenkinsfile.
There are HELM charts for every microservice in the helm
directory with separate values files for production and staging. We use Skaffold for deploying multiple HELM charts into the stage. We also tried to use an umbrella chart but it was not scalable for us.
Every new microservice we run on production writes logs to stdout, reads secrets from Vault and has basic alerts (replica count, 5xx errors and latency on the ingress checks) according to the twelve-factor app.
Whether we deliver a new feature broken up as microservices or not, there is still some core functionality in Django. This functionality still works on Elastic Beanstalk.
We used AWS Cloudfront as CDN because it was easy to use canary deploys during our migration. We started to migrate our monolith and test in on some language versions of our site and admin pages.
This smooth canary migration allowed us to find and fix bugs on production and polish our deploys in a few iterations. Over several weeks, we watched the new platform, load, and monitoring. Eventually, we switched 100% of our traffic to Kubernetes.
After that, we stopped using Elastic Beanstalk.
UPD (Nov 2019): we started to use Skaffold for production deploys also, instead of nested Jenkinsfiles.
Summary
It took us 11 months for our full migration. It was a good result: we expected it to take 1 year.
Outcomes:
- Deploy time reduced from 90 min to 40 min
- Deploy count increased from 0–2/day to 10–15/day (and still growing!)
- Rollback time decreased from 45 min to 1–2 min
- We can easily deliver new microservices to production
- We changed our monitoring, logging, and secret management infrastructure to be centralized and written as code
It was an awesome experience working on the migration, and we are still making improvements.
Don’t forget to read this cool article on Kubernetes written by our former colleague Yura, a YAML engineer who helped make it possible to use Kubernetes at Preply.
Also, subscribe to the Preply Engineering Blog for more interesting articles about engineering at Preply.
See ya!