Migrating Arbitrage to Apache Mesos

Grégoire Seux
Criteo R&D Blog
Published in
10 min readJul 28, 2020

--

Lessons learned from migrating our largest application to our container platform.

Previous posts explained why we launched the migration of our apps running on static windows machines to our container platform.

In this post, we’ll focus on the challenges (and solutions) we’ve encountered regarding the Mesos/Marathon stack.

A brief look at the stack

Each Criteo datacenter & environment is a separate cluster which means we have 11 clusters in total. Clusters vary in size from 34 nodes in pre-production to 2300 in the largest datacenter in production. The total is 9300 nodes world-wide.

The base layer of our container platform is Apache Mesos (very close to upstream) on top of which run several frameworks: Flink, Aurora, Mozart (a home-made framework to help researchers & analyst access computing resources) and Marathon. Marathon is the equivalent of the init process for Mesos: Describing your service and Marathon will make sure it stays alive as long as possible, relaunching instances when necessary.

Tasks launched by Mesos are isolated within namespaces and cgroups to give guarantees (and limits) on CPUs consumption, available memory, disk usage, accessible GPU units, and other resources.

Our stack is leveraging Consul for service discovery: All tasks are registered in Consul and all load balancing is configured through this integration.

Overall we run 18k containers on production across 800 distinct applications. The smallest app has 1 instance in one datacenter consuming 0.1 CPU, the largest one (Arbitrage which is the subject of this post) has 5300 instances world-wide, consuming 88k CPUs. As a reminder, Arbitrage is our application to select which advertising campaign is the best for a given user at a given moment. At peak time it compares 530 million campaigns per second. Arbitrage is operated by a team of Product Reliability engineers who focus on the safety, stability, and reliability of this application.

You may think we would have encountered scalability issues, but to be honest, no raw scalability limits on Marathon nor Mesos could be observed. Both are handling the load we ask them to cope with. However, many challenges arose from the way we operate or integrate with this stack. We also encountered bugs/scenarios that had a nearly invisible effect on small applications but revealed themselves when we started to upscale Arbitrage.

Now let’s dive into encountered challenges.

First challenge: The snowball effect ❄️

Photo by Nicolas Cool on Unsplash

Arbitrage, our largest .NET application, has a strong specificity compared to all the other applications: It takes a very long time to start. From a task being started by Mesos to health checks passing (and thus ready to serve traffic), it takes up to 15 minutes.

For us, it meant that losing too many instances at once lead to serious incidents and long recovery time. On other apps, even losing all instances from a datacenter simultaneously might have a short effect if they are all replaced within a few seconds by Marathon.

On Arbitrage, it meant that losing too many instances would lead to a catastrophic event where remaining instances are overloaded by traffic and eventually killed without any hope of recovery for15 minutes. We named that event: The snowball effect.

In this situation, the Marathon health check feature (which kills tasks that don’t pass their health check) is counterproductive. If we imagine a system in which the Arbitrage application depends on being broken, we get all instances to stop passing their health checks and being killed by Marathon. Recovery in this case would be long and painful.

To protect all apps against this risk, we patched Marathon to take better decisions upon health check failures. Marathon now considers the whole pool health and does not kill instances if we have less than a given ratio of healthy instances. Even if all instances stop passing their health check, Marathon will kill only a fraction of them. It gives some chance to the rest of the pool to recover if the reason for the health check failure is fixed. Otherwise, instances will slowly be killed as new instances get up to speed. You’ll remark that the latter case still leads to an incident (but hey, we can’t do magic either!).

Since Marathon is performing the health check, this feature protected us against all kinds of issues related to DNS resolution not working in the data center for a few seconds (preventing health check to succeed). Without the anti snowball mechanism, the whole application would have collapsed and recovery would have required to redirect traffic somewhere else.

This patch is not upstream yet as it changes the default behavior of Marathon.

Second challenge: Maintenance on our clusters 🔧

When Arbitrage was running on a dedicated fleet of Windows servers, maintenance was extremely rare. Servers were not rebooting very often, configuration changes that required a restart of the app were infrequent.

Our container platform being more up-to-date, it went (and still does) through very frequent maintenance where we need to drain tasks from servers to operate on them.

Before running Arbitrage on our platform, the maintenance model was simplistic: 10% of our servers within a cluster are allowed to enter in maintenance and servers go out of maintenance only when tasks they were running prior to maintenance would be rescheduled somewhere else. Maintenance was strongly coupled with application healthiness, pacing maintenance naturally.

Long warmups of Arbitrage tasks were obviously an issue but something else arose: We realized Arbitrage startup was severely overloading some back ends (databases and caches preloaded by the application at startup). The experience from the Product Reliability Engineering team operating Arbitrage showed that we could allow 50 instances of arbitrage to safely start at the same time.

Many other teams also challenged this model as too aggressive (mostly because 10% of servers were randomly chosen) and that it could drain all of the tasks at the same time.

It was the right moment to revisit the maintenance process to something safer, providing stronger guarantees. We migrated to a model where maintenance is automatically scheduled in advance, always operating on the same partition of the cluster (and tasks are forced to spread across this partitioning to minimize the impact of maintenance). For the arbitrage case, it meant that we had to limit the number of machines that could enter maintenance at the same time to 50. It’s likely that we’ll revisit that limitation later to allow for faster maintenance.

The application driving maintenance and computing schedules 3 days in advance

You may wonder from the previous paragraph why we limited the number of servers going under maintenance to 50 in order to avoid more than 50 instances being started simultaneously. It leads us to our third challenge.

Third challenge: Large instances and scheduling 🐘

All Criteo applications used to run on dedicated Windows servers so there was no incentive for an application to use less than the physical resources of servers. Over the years, Arbitrage leveraged this to preload a high amount of data in caches to achieve extremely low latency. Technically Arbitrage needs nearly all memory available on our “base” servers (48GB of memory and 32 hyperthreaded CPUs).

At the opposite, our container platform favors applications with many small instances because servers are shared between several dynamically scheduled tasks. It was obviously out of question to rewrite the entire application to fit in that model so we had to deal with it.

The way Mesos sends resource offers to Marathon is non-deterministic (to say the least, it is actually explicitly random). In practice, it means most Mesos agents have a few tasks running (instead of keeping some agents free to schedule very large tasks). This effect is called fragmentation and raises the same challenges as fragmentation on filesystems: It makes the placement/scheduling of large tasks (or file) harder as the cluster (disk) fills up.

On a Mesos cluster, it is normally the responsibility of the framework (like Marathon) to make sure to fully use servers before taking resources from another. On other schedulers such as Kubernetes, one process (the kube-scheduler) is assigning resources and can try to decrease fragmentation (or actively doing bin-packing). On our infrastructure, with multiple frameworks (and multiple marathons), it would not work as each framework has only a view on its own usage of tasks.

Since we were not able to give guarantees of scheduling, we decided to create a subset of servers within our pool that would be dedicated to run tasks requiring most resources from a machine. This creates more toil since we need to ensure this specific pool of machine is correctly sized and balanced across physical racks (in addition to checking those properties for the full cluster).

It allowed the migration to move forward with strong guarantees of scheduling for Arbitrage. The underlying issue is still not resolved today but we follow interesting leads:

  • Improve the way Mesos sends offers (especially the order in which offers are sent) to influence fragmentation without requiring to patch all frameworks.
  • Have an out of band process draining tasks from agents to keep fragmentation under control.

Fourth challenge: Marathon leader elections 👑

For resiliency purposes, Marathon has several instances and elects a leader to interact with Mesos and tasks on the cluster. Upon configuration changes or upgrade or even crash, Marathon leader is demoted and another instance becomes leader. The migration of our Arbitrage application revealed an interesting bug when a leader election happened during the deployment of the application.

The Arbitrage application was configured to allow Marathon to replace ~50 instances at the same time during deployments. In practice: Marathon kills 50 instances, spawns replacements, and then waits for one new instance to become healthy before killing another one. However, the slow warmup of Arbitrage allowed us to observe that right after a Marathon leader election, Marathon killed an extra batch of 50 instances. It triggered fun incidents as you can imagine!

Since we run Arbitrage in a Marathon instance dedicated to the team operating this application, Marathon itself is a task launched on Mesos (we have a root Marathon launching dedicated Marathon instances for each of our largest users). It means Marathon instances, as every other Mesos task, are drained frequently to allow servers to perform maintenance. This triggers more leader elections that happen during the (long) Arbitrage deployments.

To workaround that issue, we introduced a mechanism to avoid maintenance on our cluster whenever Arbitrage is being deployed. It allowed us to gain time while we fix this bug. A first patch put us in the right direction but introduced a regression for some configurations. We decided to change the way this part of Marathon is designed (react on events such as “beginning of deployments” or “a new instance is now passing health check”) and use a safer model (take decisions based a full view of the instance states a regular interval). This patch is being tested/deployed and will likely (🙏) fix this for good :).

A non-issue: Mesos master scalability

As we upscaled our clusters, we realized Mesos masters were answering quite slowly to some requests. For instance, we were fetching the large “/state.json” view of the whole cluster on all agents to make sure the agent was properly registered with the correct UUID in the master. This was a way to detect a rare race-condition where Mesos agents re-register so fast (with a different UUID) that Mesos master does not notice, leading to all offers coming about this agent to be invalid by default.

Anyway, all our agents were fetching the state endpoint in JSON every 30 seconds, but this mechanism timed out extremely frequent in our largest data centers. After a bit of digging, we understood the Mesos master was spending a lot of time serializing JSON for that endpoint. We could have worked on this part but we took a step back and reflected on the fact Mesos is known to have been used on 10k agents clusters. It made us consider the pattern: All agents fetching a very large document to simply check if their UUID was known to the master. This was of course something that could not scale correctly.

We eventually created a small static document with the list of all agents UUID, maintained up to date by our beloved consul-templaterb (which can take any JSON source despite the consul name). Fewer requests hitting the Mesos leader directly meant that the response time for all other requests immediately flattened. Job done!

With this story, we actually learned that most HTTP request/response management in Mesos masters is serialized. This means it’s not a good idea to carelessly make requests if they can be avoided (even though there were multiple improvements made in some Mesos versions to decrease that effect).

Conclusion

Most, if not all the work, we did here was needed because the Arbitrage application was developed as a monolith that came to fill the available resources on a bare server. As such, its constraints were very different from what we used to expect from Marathon apps. So we took on the challenge to make our platform flexible enough to accommodate this kind of app, as a rewrite of the application was not something pragmatic or doable in a reasonable time frame.

Carefully choosing compromises allowed to significantly improve the platform itself and benefit to all applications. It also means other .NET applications were able to be migrated to our platform (most of the time without our team being even aware of it) because significant difficulties had been resolved ahead of their migration.

Interested in constantly challenging the status quo with us? Check out our open positions!

--

--