From Mesos to Kubernetes: one year after migration

Published in

Adevinta Tech Blog

8 min readApr 8, 2021

By Sébastien Georget (Architect Lead), Benjamin Riou (Site Reliability Engineer), Serhii Tarasov (Software Engineer)

In May 2020, we migrated our content moderation platform from Mesos to Kubernetes (see part 1 and part 2). In this post, we’ll share some learnings from our first year in this new environment.

Tl;dr

This year has confirmed our first impressions: from a user perspective, Kubernetes is much simpler than Mesos. We can easily change our runtime architecture and leverage common patterns and there are many resources and tools to simplify our day-to-day. Of course, every migration comes with a few glitches and we’ve learned from them in order to improve our system’s resilience.

© https://unsplash.com/photos/CcqAFQBQV1A

Helmsman for a smooth 1.9 to 1.15 migration

A few months after our go-live on Kubernetes (1.9), Adevinta’s Platform Services team (who provides us with the Kubernetes service) suggested migrating to a new cluster running Kubernetes 1.15.

To be honest, we were initially slightly afraid of going for a second migration so early in the process. The migration from Mesos had already impacted our roadmap and at this point, we wanted to focus on core business needs.

We hoped our architecture and technology choices were flexible enough to make a new migration easy! As detailed in the previous article, we’re using Helmsman to automate our Helm chart deployments. In a few files stored in git, we describe every Helm chart that has to be deployed in order to have a fully operational instance of our component for a tenant. We also have some Helmsman manifests to describe our shared services (like authentication).

Thanks to Helmsman, running our application in a new cluster is as simple as changing the target cluster in a Kubernetes configuration file and then deploying everything. The tricky part has been managing the two deployments in parallel. We were fine from a runtime point of view, as our microservices are stateless, but the challenge was on the deployment side: we didn’t want to block deployments in our main cluster due to issues occurring in the new ones. Breaking changes can be a source of potential problems for any service if the current version is not compatible with a previous one. It doesn’t happen often, but we asked the team not to perform such deployments during the migration. It would however remain possible in case there was an urgent need.

So we switched on all the services in the second cluster and had all of our applications running in both of them. We cloned the default deployment pipeline and configured it to use the new cluster, so we could deploy changes to both clusters at the same time.

We used the “route-dispatcher” component (from the previous article) to route the traffic from one cluster to another progressively (from 1% to 100%). We then changed all CNAMES to point to the new cluster ingresses.

Finally, after having configured the default pipeline to use a new cluster and removing the legacy one, we were able to delete all Helm releases from the old cluster. It was the last step!

K9s to the rescue for day-to-day operations

Once upon a time was kubectl…

Kubernetes is a container scheduler, but above all it’s an API. With any API comes an API key, HTTP requests, get, set, put etc. To simplify our daily lives, Kubernetes comes with its own API client named kubectl — this is a command-line tool that accepts syntax similar to subject-verb-object sentences. kubectl describe pod MyPod; kubectl edit configmap Moderation; kubectl get ingress ui-leboncoin are good samples of daily commands.

You’re a professional engineer, but are you a professional typist too?

The main problem you may face when working with Kubernetes is that you need to regularly interact with the kubectl CLI, requiring you to issue long and repetitive commands. Each kubectl interaction can provide a huge result set, where finding information requires extra grep-ing or filtering with labels or annotations.

On top of that, each result set is frozen in a unique point in time and monitoring changes (e.g. an active deployment) means you can end up typing the same commands continuously.

K9s, this unnecessary tool you can’t let go

K9s is a “Terminal UI” tool that helps you manage Kubernetes clusters of any size. It allows you to interact with any object, quickly list them, fuzzy-search within them and act the way you’d do with the kubectl CLI. Any piece of information in K9s is updated in real-time, with a very convenient colour scheme pointing out any failures or newly created objects.

K9s can be compared to an IDE for software development. You don’t need an IDE to develop, but it makes it so much easier that you won’t go back to a simple editor.

Easy to use and intuitive, there are only four keys you need to know to get the most from K9s.

: Switch between the Kubernetes object types (pods, deployments, …)

/ Fuzzy-search for an entry within the list

? Display the filtering options

Escape Return to the previous screen

Besides its stability and practicality, the advantage of this tool is that it empowers developers to track their deployments and access their applicative logs autonomously, even if they’re not familiar with Kubernetes. K9s also gives a crystal clear representation of the current cluster state. In Adevinta, we use it to display failing deployments while a new version is going to production.

You don’t need K9s to operate with Kubernetes, but once you get it, you won’t turn back.

Yaml to increase our resilience

One of the main benefits of our migration to Kubernetes is that we can now easily adapt and simplify our runtime architecture. In Mesos, these changes had to go through some developments in the C codebase of our Mesos framework. Doing the same thing in Kubernetes is just a question of creating a new Helm chart with the required resources. To put it differently: you only need a few lines of YAML! Super simple and yet extremely powerful!

© https://unsplash.com/photos/3ygbrOoEMjs

Sometimes it’s not even necessary to create a Helm chart. In Mesos, we had a single entry point for all our tenants. During the Kubernetes migration, we decided to instantiate one dedicated entry point per tenant in order to isolate resources. This is advantageous as some tenants can experience spam attacks and we don’t want to impact the others in such circumstances.

Since we already had a dedicated DNS entry for each tenant, we proceeded in two simple steps:

We adapted the Helmsman configuration for each tenant to deploy a dedicated entry point i.e. we instantiated a Helm chart containing an ingress and a deployment
We progressively changed the DNS entry from the old entry point (ingress) to the new dedicated one

We’re now considering having dedicated entry points for each kind of content type (ads and messages). Once again, it should be simple: we’ll just have to edit our Helmsman configuration to deploy two instances of the Helm chart (one per content type).

With Kubernetes, isolating tenants is just a matter of adding a few YAML lines describing the dedicated components we want for each of them. We don’t manually edit any YAML files; they’re templated and instantiated for each tenant to avoid mistakes and to simplify maintenance (like updates to deploy on all tenants).

Sidecars deprecation

To make the migration from Mesos easier, we decided not to change our microservices. One initial consequence of maintaining the same microservices was that we kept our custom legacy communication protocol (JSON-RPC) and used a facade sidecar to be able to access the microservices through HTTP.

This had two side-effects:

Higher costs: because the sidecar was consuming resources (since we have ~1000 running pods, even a small sidecar requesting .1 CPU & 64M generates some costs)
Lower stability: the sidecar introduced an intermediate step that was a potential bottleneck. As a result we had some incidents under high-load related to the extra latency and the timeouts and connection pools.

That’s why we decided to remove this component by adapting our microservices to directly expose an HTTP endpoint. Removing the sidecars helped reduce our costs, and in the long run it’ll improve the maintainability of our solution.

© https://unsplash.com/photos/ifIkolRvAUg

We still use a Datadog sidecar per pod to fetch our application metrics and send them to Datadog’s servers. We’re considering removing this too in order to reduce the resources each pod consumes. It’s not a priority however, so we will do it during the deprecation of our core library (the C part that we no longer want to modify).

Right-sizing and cost savings

During the migration process, we implemented safe settings regarding the reservation of resources. After a few months, we were confident enough to adjust the resources and replicas for some services. The migration to Kubernetes 1.15 has also helped as the auto-scaling was much more stable (additional benefit: we can now use multi-criteria auto-scaling if needed).

To optimise the use of resources, we’ve used a dashboard, provided by Adevinta’s Platform Services team, that shows the costs of each deployment daily. We’ve used this dashboard to identify the pods that were below their reservation and adjusted them accordingly.

Thanks to this work and the sidecar deprecation, we’ve reduced our costs by 50%.

© https://unsplash.com/photos/ELXxMMSF_L8

What’s next

We now have a stable and cost-effective setup that requires very little maintenance. As a result, we’ll now be able to focus on deprecating legacy libraries that are still dependent on the Mesos framework.

In parallel, we’re also looking at refining our internal organisation to better align the squads with the business problems we’re addressing. This work is likely to lead to some changes on the infrastructure side, but hopefully, thanks to Kubernetes, it will just be a matter of writing new Helm charts and Helmsman files. So keep an eye on our blog as we might cover these changes in another post!