It all started in 2019
For more context, or why we decided to migrate, please read Maxime’s blog post on the topic.
In the beginning of 2019, many teams had worked hard to make .Net Core a first party citizen in our CI and CD. We were then asked to migrate one of the biggest apps we had: Arbitrage. After all, if this one got through, all the others could, easily.
Arbitrage is the app that decides, for each ad display opportunity we have, which ad campaign is the best. Just so you can get an idea of the scale, we’re talking about 5300 servers (full windows machines, with 64GB of RAM and 16 cores hyperthreaded, giving 32 virtual cores). If this seems like a lot, think that these CPUs evaluate 532 millions campaigns every second to see which one fits best the 4 millions users we see every second as well. All that in less than 10ms on average. So yes, that’s a lot.
And the particular thing about Arbitrage is that is uses hundreds of prediction models, leading to a memory consumption around 40 to 50 GB of RAM. So there’s really no way to make two run on our hardware.
So all that was left was to prove the system worked also at scale. What could go wrong, right?
As it turned out, several things could go wrong.
The first thing that went wrong was performance. Where a Windows server could muster ~ 1400 requests per second, on Mesos it started doing lots of timeouts way below half that rate. Migrating as-is was not an option.
At that time, we had already migrated another small app to Mesos, and we worked around the performance issue by setting up smaller instances instead of instances using all resources from the physical server. With 10CPU and little traffic, the performance issue was not noticeable. Added bonus, the smaller instances facilitated the scheduling of our app.
But Arbitrage requires most of the RAM of our bare metal servers. This is a fixed cost of preloading many things to answer to requests as fast as possible. So there is no way for Marathon/Mesos to schedule two “small” arbitrage instances on a single server. Arbitrage takes the full machine. Period.
So we created “big instances” that would require 30 cores out of the 32 available on a bare metal server. The two remaining cores are dedicated to technical tasks (Mesos, Consul, etc…)
The performance team figured out a few reasons why the .netCore runtime on Linux had trouble running our big app. Among them, CPU affinity for GC threads, a few APIs that were particularly slower, a few kernel optimizations, gen0 size, etc. Most of that required profiling and in depth search down to the source code of the CLR. After a few months, they said that now Arbitrage was running fine (read, with the same performances than on Windows) on Linux.
More details can be found in Christophe’s blog post.
Alas, what we observed on Mesos wasn’t really on par with what we had on Windows, and what they had on a bare metal Linux server. Same version of the kernel, of course. It all seemed related to the GC, without us understanding how.
The aforementioned performance team gave us a hint, that the cfs.bandwidth CPU isolation and the network isolation in place on our Mesos cluster at that time might be responsible. We asked for 10 machines that would be in the Mesos cluster, but on which these isolations would be disabled. We ran Arbitrage on them but also other smaller apps, JVM and .Net Core alike. All apps showed the same behavior. With only one of the two isolation, things were marginally better. With none, our apps would be able to take the “normal” Windows-equivalent load, with the same response time.
It was decided by the containers team to remove the network isolation on all of our machines that had 10Gbps network interfaces, because we had a decent margin. The CPU isolation was moved from cfs.bandwidth to cpuset cgroup, granting a number of cores that would be dedicated to the process, instead of granting everything and throttling the process on a regular basis.
As it turned out, the garbage collector (in .net or jvm) is already something that will freeze your app from time to time. Adding something else that also freeze your process from time to time is not a good idea. Adding a third one is a very bad idea. Once all three of those things kick in at the same time, you can easily get pauses over a second.
Now that the performance was up to snuff, we decided to run the test at scale in one of our datacenters. We would need 400 instances of Arbitrage to be able to assess that everything worked at scale. Alas, that’s when scheduling issues arose.
The way it works is that Mesos sends offers to schedulers, but schedulers never sends its needs to Mesos. So with enough tiny to small apps to schedule in a cluster, your 400 servers all end up with at least one or two tiny apps on them, reserving at least one CPU. And you cannot schedule Arbitrage anymore, because it needs an entire machine.
A fix on the scheduler’s side would require to “fix” them all to try and get as small an offer as possible whenever they have something to schedule. That would also be a constraint on any other scheduler we would like to add in the future.
We didn’t fix it yet. What we did was work around it. So for now, there are a few hundred servers in each datacenter that are dedicated to Arbitrage. They are dedicated to our team’s Marathon, and flagged in a way that we can explicitly target or avoid them. This has a number of drawbacks obviously, such as not having one pool of machines, but two to handle. But it did allow us to move forward in the meantime.
You can read more details in Gregoire’s blog post.
Since Arbitrage takes 10 minutes to start, and consumes enough resources that you cannot start more than 100 instances at once (see below), we faced once a “fun” side effect related to new Marathon leader election mechanism.
One day, there was a maintenance in progress. So our bare metal servers rebooted one by one after being updated. Our marathon leader got killed in the process and a spare one was elected leader. The new leader had not done any form of real work for a while, and the first thing the JVM (because Marathon in written in Java/Scala) decided to do was a full GC to go with a fresh start. However, most of the HTTP calls to check the health of the few hundred instances handled by Marathon had been done before the collection started. I’m sure you see it coming now: the response time plus the GC time was longer than the Marathon threshold so it decided that ALL instances were unhealthy, and chose to kill them all.
All of a sudden all instances of all of our apps were killed.
For most apps, that start in a handful of seconds, the incident was over real quick. For Arbitrage it was another matter. It took a couple of hours but we got it back up, with a few gallons of elbow grease.
Another incident led to half of our instances being restarted. Unfortunately, the remaining instances were (obviously) unable to take the remaining load, so they started not answering to their health checks. What did Marathon do? Kill them obviously, since they seemed to be in a bad shape. Indeed, they were, but killing them made the problem worse.
The container team was then tasked to find a way to make sure no such thing would ever happen again. We called it the “snowball patch”. Whether it sees everything as unhealthy or not, Marathon now follows the app configuration, killing no more instances than is acceptable. This way, if everything becomes unhealthy, the system can recover, starting instances by batch, and if it was a false positive, it only respawns a handful of instances, leading to no downtime for the whole system.
You can read more details in Gregoire’s blog post.
Do not start everything at once
The last constraint our application has is that, at startup, it downloads mindlessly many kinds of stuff from different places (Memcaches, remote file storage, SQLs, …). The goal is obviously to be up and ready with all caches in memory to answer requests as fast as possible. So if we start too many instances at once, we risk to DDoS our own data and file infrastructure. Hence, we need to throttle the startup at around 50 instances at time. Unfortunately, Marathon doesn’t understand this kind of constraint.
For this one, we built a few securities here and there, but the issue is still there. However, we decided it was not the scheduler’s job to make sure the application would not DDoS its own infrastructure.
It’s hard to blame anyone else than the app for this. So we’re currently building a centralized system for our various applications (and subsystems) to be able to get a sense of a resource’s utilization before hammering it. It happens during startup, but is also applicable at any other time in the application lifecycle.
To be honest, we’ve had a few incidents in these recent years, where thousands of instances would mindlessly hammer the one system. Not all of them were app starting. It is time we address this one separately.
We’re ready !
At some point, we were ready to fully migrate. Performance was good, scheduling was good. In fact, one datacenter was already running 95% of Arbitrage on Mesos. That was around black Friday last year. Let’s ask for servers and pop the champagne bottle !
On many datacenters, we had enough spare servers to add them to Mesos, migrate the traffic and give back the now unused Windows machines. But not in all of them. Followed a few meetings with the qapla team (the team responsible for capacity planning) and by extension, responsible for giving out servers.
On our biggest datacenter, we needed 1700 servers, but only 100 were available in the flavor needed. (A flavor is a type of server. Some have lots of RAM, some have lots of HDD, some lots of SSD, …) It was obviously unthinkable to put those 100 servers in Mesos, migrate 5% of the traffic on them, grab 100 more windows servers, put them in Mesos, and do it all over again for 17 times… it take roughly a week of delay to do one of those iterations.
The solution was to get every single server of every flavor available or already affected to a project but unused. One of the strength of Mesos is to be able to handle a motley fleet of servers. We could muster around 500 of them. We stuffed them all into Mesos and started shifting traffic. We then moved Windows machines to Mesos and started again. 4 iterations later, we were done.
Retrospectively, this was a lot of work.
One way we could have made it simpler would have been to do this in three steps: Migrate to .netCore, migrate dedicated to Linux servers and then finally, migrate to Mesos. Doing the three migrations in one made it more difficult to figure out where the problems came from.
One other factor was to operate two Arbitrage pools for more than a year. Double monitoring, alerting, versioning, releasing… Basically double trouble.
But thanks to the amazing teams we worked with, containers, performance, qapla, we went through with it and the migration was a success.
Now we get on the rest of our C# applications. The journey barely began
Read more on this topic in our mini series!
Moving .NET to Linux at Scale
The story of a multi-year migration: How we changed Criteo’s whole foundation.
The .NET Core Journey at Criteo
This post shows the challenges we faced during the migration to .NET Core.
Migrating Arbitrage to Apache Mesos
Lessons learned from migrating our largest application to our container platform.
Interested in joining the crowd? Check this out: