From 0 to 10’000 Jenkins builds a week — part 3: keeping the workload under control.

Stéphane Goetz
Swissquote Tech Blog
11 min readOct 6, 2023

In 2023, Swissquote runs 50 fully automated instances of Jenkins in Kubernetes, one per team. Each code push from a developer results in a build in the team’s instance. If the project doesn’t exist yet, it gets created automatically. The cluster performs approximately 10,000 builds per week.

As a kid, I loved to build LEGO Technics. The part I loved the most wasn’t playing with the sets once done. What I loved the most was improving on what I built repeatedly. Should I make this part stronger? Could this have bigger wheels? Can it go faster?

Without surprise, I took this curiosity to improve things with me in my professional life. This leaves me constantly wondering: can the build farm be improved somehow?

2019: Measuring and controlling workloads

Is the build farm healthy right now?

That isn’t easy to know, and we tried to get the answer through many different sources. To be fair, before 2023, we couldn’t give a confident answer to this question.

One approach was to look at the ratio of failing builds, on which we got a more precise understanding in January 2019 when rewriting our build statistics collection script:

In February 2019, the build farm ran around 200 jobs per day

This approach wasn’t helpful, as the stats were collected once per hour, and information was no longer relevant when we were able to consult them (a branch could have been created, built, merged, and deleted within that hour; that data point wouldn’t have been collected).

We also started collecting the number of failures per build machine, which didn’t help surface any relevant indicator.

One thing we were able to know for sure: If the 15-minute average load on a build node is higher than the number of logical CPU cores, the chance for builds to fail is way higher.

Hungry build: Build Farm Sentinel to the rescue

With more and more users of the Build Farm, we needed something to keep bad actors in line so that most engineers could benefit from the added value of a shared environment.

Our brilliant product naming skills helped a lot; we created the Build Farm Sentinel. It regularly checks for builds running on the cluster and kills any build running for over 90 minutes.

2020: Sharing a limited set of resources to an ever-growing amount of builds

In 2017, our team created and open-sourced Carnotzet, our Sandbox solution.
My colleague Manuel Ryan wrote an article about it:

This tool has one prominent feature, which came to be our worst nightmare: each application can define which other applications they depend on, and by running mvn sandbox3:start would resolve the dependencies recursively, create a docker-compose.yml file, and start all the applications.

Instead of creating mocks, the actual application was considered more reliable for integration tests. This proved true for a long time, as we didn’t have to maintain mocks for most projects. In exchange, 32GB of RAM became a bottleneck for some teams with sandboxes with more than 60 Docker containers, starting a JVM with Tomcat for each project and an Oracle database with crawled data.

A Carnotzet is a Swiss-french name for a place, usually in the cellar, where you can store wine bottles and a place to drink a few glasses with friends. image source Image: myvaud.ch

What do a sandbox and a wine cellar have to do with Jenkins?

Now imagine that these giant sandboxes must be started to run your integration tests; you get a thundering herd of Docker containers to start and many spikes in memory usage.

To make things worse, these containers are started using the Docker socket, not through Kubernetes, which makes them very dangerous for a few reasons.

  • Kubernetes is unaware of the extra load on a machine because it monitors only the containers it started.
  • The extra load is added to the machine where the build is running and can’t be distributed to other nodes.
  • Sometimes, the build stops incorrectly, and the sandbox isn’t stopped, meaning the resources are still in use.
  • More recent Kubernetes installs don’t use Docker as a container runtime, making an update more difficult.
Many containers can be present on a Kubernetes node and add “phantom” workloads that Kubernetes is not aware of

We have an optional runtime for our sandbox that can start Kubernetes pods instead of Docker Compose. But it comes with some tradeoffs and a need to rewire many parts of the configuration, which meant it was adopted only by new projects.

Taming Sandboxes

We had to explore a bunch of solutions, both to prevent and remediate issues with the sandbox.

Remediation measure 1: Sentinel v2

Our sentinel was doing a steady job at cleaning builds after they overstayed their welcome on the build farm.

We created a new Sentinel that runs on every cluster node, lists all the running sandboxes, and can decide to evict them depending on some criteria.

  • Suppose Sandboxes run for more than 90 minutes: same as for builds. If the build is killed, the sandbox has no reason to stay.
  • If the total of the containers exceeds 20GB of memory.
  • If the total amount of RAM used by sandboxes starts to be dangerously high and threatens the stability of a node, kill the biggest sandboxes.

In 2022, we also added new rules to make the sentinel more efficient

  • If no builds are running on the node, all sandboxes can be flushed
  • Using a newly added annotation on containers created for a sandbox: the name of the build that started them. The sandbox can be removed if no build by that name is running.

Apart from the cleanup, the sentinel would also send messages on a Rocket.Chat channel with the name of the sandbox, all of its containers, and their memory footprint.

We first ran the new sentinel to capture some data not to disrupt our colleague’s workflows. We added the more common occurrences to a list of allowed sandboxes with a deadline for each sandbox where the entry would have no effect anymore. In its first week, after fine-tuning it, we evicted 61 sandboxes, but as you see in the screenshot below, many sandboxes were eating a lot of memory and were left alone.

The situation is much better a few months later: Fewer evictions are needed.

The sentinel is a lifesaver for the build farm.

Preventive measure 1: Smaller sandboxes

As logical as it may sound, this isn’t easy to achieve. Carnotzet was designed to use Maven dependencies; adding a dependency to a sandbox is five lines of XML to add to a file. The result is one or more containers that are spawned automatically when starting the sandbox.

On the other hand, reducing the size of a sandbox requires creating mocks or other simulation servers, and each team wants to approach this differently. Our team and the team in charge of the CI pipeline have tried and had limited success in reducing the size of sandboxes. Some teams have migrated to mocks entirely and would never want to return to giant sandboxes, and others can’t afford to invest in this for now.

Since mocks usually also allow shorter build times, we’ll explore this again.

Preventive measure 2: Docker in Docker

As I’ve explained above, sandboxes started by builds aren’t visible to Kubernetes, and a Docker container starting another Docker container (called Docker out of Docker or DooD for short) can be very harmful from a security point of view and is regarded as a bad practice.

Another approach exists and is called Docker in Docker (DinD).

The trick is to start a Docker daemon within your Docker container. This means that containers started during a build will be done within the build container, which brings the following advantages:

  • Kubernetes can track the workload of sandboxes
  • We can accurately know for each build what they start and how many resources they need
  • Once the build container is stopped, all sandboxes and other containers it may have started are destroyed and removed.

We were eager to try this solution as it ticked all the right boxes. Our fully automated configuration and update system allowed us to iterate quickly with a POC on one instance and even to deploy a first version to all instances for a few days. But we had to roll back to Docker out of Docker quickly after.

There was one big issue for which we had no good solution: Docker images

Using one daemon on a machine, any container it starts will download the image the first time (many sandbox dependencies are used in more than one sandbox). Usually, starting a sandbox only downloads a few images.

Starting a fresh Docker daemon does not have any image cache. The result was so bad for some builds that it took 40 minutes to download and start the whole sandbox, which doesn’t even count the actual build time.

We had to shelve the idea for now but still thought it could be a way forward.

2022

In 2022, Renovate gained a lot of traction at Swissquote, which means a lot of pull requests were created in many repositories, which in turn means a lot of builds running on Jenkins

Renovate is a tool to help keep your dependencies up-to-date at a defined schedule. The tool will search for software dependencies in your repositories and, if updates are found, will create a Pull Request on GitHub.

I wrote an article last summer about Swissquote’s journey in adopting Renovate:

We could sustain the increase in the number of builds with minor adjustments.

That was until…

20’000 builds in a week

In the last week of October 2022, an issue on our releases server (No disk space left) and a minor bug in our Renovate installation sent us on a bumpy ride, more than 5800 builds on a single day on Monday.

22’390 builds … that’s the most loaded week we have on record

We were able to fix both the disk space issue and the Renovate bug within a week to get back to our usual load.

We’ve since refined our statistics collection to determine whether Renovate was the cause for a build. Here is a typical week in February 2023:

8’561 builds in a week in February 2023

Note that we discovered that our statistics collection miscounted some builds. When on a first collection the build was “in progress” and on a second collection the build was finished with a “success” or “failed” state it would count two separate builds. Fixing this issue reduced the total build count by about 30%.

This week also helped us refine new rules to add to the Sentinel v2: Sandboxes that are not linked to any builds are evicted. This helped reduce the load on busier days.

Preventive measure 3: Duplicate build cleanup

Sometimes, a project changes owner. Due to the automated nature of our build job creation, the project gets added to the new owner’s Jenkins controller but won’t get removed from the previous owner’s instance. This means that it will build twice on each change.

We’ve known how to identify these duplicate builds for a long time but left the teams in charge of their instance with the task to remove those builds … with minimal success.

Since we already know which repositories are duplicates, we scheduled a task that removes the Job if scheduled on another instance, which is the repository’s owner.

At the same time, we started automatically cleaning up jobs attached to archived repositories.

Preventive measure 4: Disable auto-merge building

Jenkins’ github-branch-source-plugin (the plugin in charge of discovering buildable branches and Pull Requests from a GitHub repository) has a feature to automatically build a merge’s “potential result” for Pull Requests.

Whenever a push is made to the “target branch” of a pull request, all related Pull requests are rebuilt.

Some projects can have many open Pull Requests due to the chosen branching model, or many Renovate Pull Requests left open.

We decided to change this behavior to build the tip of the branch directly to avoid rebuilding branches too many times.

When investigating this, we discovered that the github-branch-source-plugin recently changed its default behavior not to create this potential merge. This encouraged us to change the default in new projects. Then, go the extra mile and patch every existing project to apply the “HEAD” strategy instead of “MERGE.”

Fun fact: I should have renamed the article “from 0 to 5’000 builds a week” from this day on. That certainly doesn’t sound as thrilling as 10’000, but in any case, we know we can sustain much more load than we currently use. Like Let’s Encrypt’s infrastructure, ready to re-issue all certificates within 24 hours if needed, we are prepared for massive spikes in the number of builds and releases.

With 4’870 builds in a single week, you can see that the load was significantly reduced.

Conclusion

As we’ve discovered, there is not much we can do once a build is scheduled, but we can do quite a few things to reduce the number of builds.

  • Remediation measure 1 — Sentinel v2: Helped clean up leftovers but didn’t stop machines from dying.
  • Preventive measure 1 — Smaller sandboxes: Tremendously helped the teams that adopted them but had a limited impact overall.
  • Preventive measure 2 — Docker in Docker: This experiment failed, but it helped us understand where we could go with the idea. Part 4 of this series will show you what we did to solve that issue.
  • Preventive measure 3 — Duplicate build cleanup: Helped remove a few builds, applicable when the organization changes, such as team splits.
  • Preventive measure 4 — Disable auto-merge building: Reduced the number of builds by ~50%.

All these measures brought significant improvements in stability in our cluster, but the big issue remains giant sandboxes.

Since teams have autonomy in their build and test setup, any improvement we bring to Carnotzet will take time to materialize as teams will need to update their projects first.

We can imagine more improvements over the current setup, such as Carnotzet automatically starting a timer and shutting itself down after a delay (similar to Testcontainer’s Ryuk). But the giant sandboxes are still present and will affect other builds while they’re unknown to Kubernetes.

Stay tuned for part 4

In Part 3, we ensured the cluster could survive the increasing load put on by a growing number of builds. Part 4 will cover how we’ve changed strategy by migrating to a new cluster and fixing the giant sandbox issue at the source.

Update 15/11 Part 4 is published:

--

--