Migrating our content moderation solution to Kubernetes

Part 2: How we changed our strategy and went live!

Adevinta

Published in

Adevinta Tech Blog

6 min readOct 15, 2020

By Sébastien Georget, Architect Lead

tl;dr: we migrated our 1100 container workload from Mesos to Kubernetes in order to moderate 10M ads(*) and 100M messages(*) per day in six months without stopping the service.

(*) These are all the events we are receiving and moderating, which may be slightly different to the published number of ads and messages moderated by the marketplaces because we may moderate related content multiple times (e.g. when an ad is edited).

In the previous article we described our motivation to migrate to Kubernetes and our first re-engineering approach. Here we will see how we pivoted to meet our deadlines and the results of the migration.

The lift and shift strategy: to meet our deadlines (Q1 2020)
The rollout: the graph above! (May 2020)
The learnings: what we discovered during this journey

The lift-and-shift strategy (January 2020)

With three months left to migrate, we had to make a decision: continue with the “re-engineering” or find another way. And guess what? We changed to a “lift-and-shift” strategy!

*Credits* *https://unsplash.com/@elevatebeer*

What exactly does that mean? Go simple: don’t re-engineer anything, just do the minimum number of changes to be able to run in Kubernetes. We had to drop a few weeks of work, but in doing so we were able to re-focus our efforts on the real challenge of the migration: keep the auto-moderation system working in Kubernetes and prepare the migration.

So on the Kubernetes side we went back to a generic chart to deploy all our modules. Instead of creating ad-hoc charts from the beginning, we created one for all, and forked it only when necessary. In the end, we had four charts to deploy 20 modules.

We also started to look at continuous deployment,and decided to go with helmsman in order to manage our helm deployments from git. The deployment of a marketplace’s stack is simply described in a helmsman file (yaml) and some value files (yaml) generated from our current configuration.

On the application side, we introduced a new component that has played a key role in the migration: the “route-dispatcher”. Placed between the marketplaces and Serenity, its purpose was to ask the marketplaces to migrate to a new URL and then manage the migration from the Mesos backend to the Kubernetes backend on our side. This is usually described as the strangler pattern. And it worked pretty well!

The rollout (May 2020)

Yes, we were targeting the end of March, but in the end we started the migration at the end of April…

Thanks to the route-dispatcher, testing and observability, the actual migration was almost the easiest part. We were able to send 1% of the traffic to Kubernetes (e.g. a percentage of ads for a given marketplace) in order to get data to confirm that everything was fine and rollback if anything went wrong.

After this first verification step, we progressively increased the volume to look for performance issues, adding 10% at a time across each marketplace until we reached 100%!

And on the 12th May we have reached 100% of our traffic in Kubernetes!

*Credits* *https://unsplash.com/@mathiasjensen*

The learnings

For such a big project, you need prioritisation, communication and focus. That’s why we discussed the benefits of this migration with our product team and agreed to include it in our OKRs. Bonus: it was perfectly aligned with the Trust & Transaction tribe’s “platformisation” OKR :)

When you have deadlines, the lift-and-shift strategy is the safer way to go. We now have plenty of time to perform some refactoring without any pressure.

Go live as soon as possible. We could have put some components in production earlier and we would have learned from those rather than discovering some issues during the actual traffic migration.

On the technical side, we have moved from a home-made deployment and scaling system to Kubernetes, so we have a lot of things to learn. The first lessons include:

auto-scaling rate limits: in our Kubernetes setup, we can only double the number of running pods every three minutes. So we had to adjust some values to be able to handle the load (and spikes) properly.
deployments and horizontal pod autoscaler incompatibility (when misused): before production we were deploying without any load so everything was fine. This is just when we got some real load that we discovered that our deployment was resetting the auto-scaling values (e.g. a component scaled to 10 instances due to the load, was downscaled to 1…). Hopefully, someone in the team found a fix for this.
the need for proper liveness and readiness probes tuning: once again we wanted to optimise some parts by reducing some timeouts. The result has been that some pods were killed way too early compared to our previous system. So we went back to the values we had in Mesos.

Even though it’s still recent, we can already feel some benefits. Of course we had some surprises during the migration, but Kubernetes troubleshooting is much easier than it was in Mesos:

we can connect to pods to see what is ongoing (through the kubectl exec command)
we can request them (with the kubectl port-forward command)
we can describe each component and each related resource in details (like looking at the auto-scaling values for each component with the kubectl get hpa command)

There is a rich tooling ecosystem around Kubernetes that is really helpful (like the k9s tool).

The next steps

We still have an important part to migrate from Mesos to Kubernetes: the GDPR requests handling. We have members of the team actively working on this for a few weeks because we had to re-implement it, as it was too closely tied to the Mesos framework. In fact, this could be completed by the time you read these lines ;)

Now we’re running on Kubernetes, what should come next in terms of refactoring?

Our code still uses a C library from the previous environment, should we drop it?
We have a sidecar running on each pod as a gateway to our former communication protocol: should we replace it by implementing HTTP endpoints in our modules?
We now have a development environment: how do we want to use it?

Let us know what you think we should focus on next in the comments!