From 0 to 10’000 Jenkins builds a week — part 2: Automating a fleet of applications

Published in

Swissquote Tech Blog

10 min readJun 26, 2023

In 2023, Swissquote runs 50 fully automated instances of Jenkins in Kubernetes, one per team. Each code push from a developer results in a build in the team’s instance. If the project doesn’t exist yet, it gets created automatically. The cluster performs approximately 10,000 builds per week.

Part 1: A setup everyone loves
Part 2: Automating a fleet of applications ← you are here
Part 3: Keeping the workload under control

As we’ve seen in part 1, we’ve successfully created a build farm for all teams at Swissquote.

When I joined in 2011, we were around 350 employees, and today we are more than 1’000. The number of employees is only the visible part of the iceberg; the number of products, the number of code repositories, the number of lines of code, and the number of teams all exploded.

This second part will walk you through how we planned for this constant growth, maintained and upgraded our hardware and software stack, and made it more manageable over time.

2017: From Vagrant to LXD

When starting our cluster, we had only four machines; installing them wasn’t too complicated.

But we soon kept adding more machines (23 machines in the cluster when writing these lines). While bootstrapping a new machine is straightforward, keeping the configuration in sync on all machines became more complex.

That’s when a colleague from IT pointed us toward LXD.

LXD is a system container and virtual machine manager. It offers a unified user experience around entire Linux systems running inside containers or virtual machines. In the same way, Docker is a tool to put applications in containers, and LXD is a tool to put an OS in a container.

Moving from bare-metal to LXD came with challenges, like mounting disks through multiple levels of Linux cgroups. We improved our bash scripts to run within a clean LXD container and migrated each node individually. The move was fully transparent to our users.

We now had fungible nodes: if a Kubernetes node behaved incorrectly, we could investigate, fix the configuration in our script and quickly apply it to the entire cluster. It also helps to know that a container restart will always bring us back to the same clean state.

Productivity and the Common pipeline library

A new team was created to help with the increased number of engineers, projects, and code repositories: Productivity.

Some roles of Productivity include providing tooling to automate and industrialize the delivery of features up to production and promote DevOps principles inside the department.

One of the first initiatives of this new team was to take all custom pipeline scripts teams came up with and create a single, configurable build pipeline.

This pipeline is opt-in but suits most use cases, 90% of projects use this pipeline, and it also evolved to support iOS and Android app builds.

For most projects, adding jenkinsPipeline() inside your Jenkinsfileis enough to get: One-click releases, broken build warnings by chat message, and SonarQube Integration—all with a single line of code.

2019: GitHub Enterprise

Mercurial, the code repository we used in 2019, had many things that needed improvement at our scale. We weren’t using any code hosting platform and had no tools for Pull Requests. Repository creation was done by SSH and permission management through Apache configuration.

To improve on this situation, we migrated to GitHub Enterprise in 2019. My colleague David wrote an excellent article on the topic a while ago:

Improved source code management with GitHub Enterprise

At Swissquote we used Mercurial, and here is the story on why and how we migrated to GitHub Enterprise…

medium.com

The challenges of managing a fleet of applications

When we reached 30 teams with their own Jenkins, which they can customize in any way they want, like adding plugins or experimenting with configuration changes, this meant that updates to a Jenkins instance could work fine in most cases but break in subtle ways in some cases. Or it could break in 30 different ways.

Updating Jenkins was just not a battle we could win

We needed a new approach to manage all these instances and get ready to add more.

Immutable Jenkins; trading some configuration possibilities in exchange for stability

Our goal was to create a Docker image for Jenkins that would provide solutions for the following problems:

Contain the minimal set of configurations to start a new instance without copying files around.
Update seamlessly, as sometimes updates would break some instances for half a day before we could figure out what was wrong.
Still allow as many things as possible to be configurable, but not allow changes that would break the configuration or override company policies (like permissions or maximum build size)

This process took months and was our biggest challenge since the creation of the build farm itself. I will walk you through the five main steps we took to stabilize the situation:

Use the same plugins in all Jenkins instances
Align the Kubernetes configuration through a Helm chart
Configuring Jenkins plugins automatically
Building Jenkins on Jenkins
Deploy changes automatically and regularly

Step 1: Use the same plugins in all Jenkins instances

When starting with Jenkins, teams needed the flexibility to download the plugins they needed to experiment with and improve their processes. But at this stage, providing this flexibility became a significant maintenance burden.

Jenkins offers a scripting console that can be used in the UI or through a REST API. This works like a poor man’s GraphQL. POST a script, process some data, and print the output as JSON. This allows one to make any request in a single query, no matter the complexity.

We extracted a list of Jenkins versions, installed plugins, and configuration values using this technique. This helped us create a dashboard to track who used which plugin at which version.

Once the dust had settled, we made the list of all required plugins and added them to Jenkins’ standard plugins.txt file to pre-install plugins at build time. Then we ensured that when the Instance started, it would clean the old plugins and use the ones bundled with the Docker image. This shortened our feedback loop; the errors were discovered during the build instead of the deployment phase.

Step 2: Align the Kubernetes configuration through a Helm chart

When we started, we knew very little about Kubernetes, and our process was more far-west than GitOps. Since we learned about Helm Charts for SonarQube, we applied the same principle for Jenkins, and we were now able to configure an instance with a small (~5 lines) YAML file per team that was neatly stored in the GitHub repository next to the Dockerfile for Jenkins. Any changes to those files would be quickly and safely deployed to all Jenkins instances. This eliminated all “whoops, I forgot to apply the new config to these instances.”

The term GitOps can mean many things, for us it is: Put all your configuration in a Git repository and automatically update live systems’ configuration when the repository’s content changes.

Step 3: Configuring Jenkins plugins automatically

Now that all instances have the same Jenkins version and the plugins are identical, it’s time to configure them. We’ve first tried an up-and-coming option: Jenkins Configuration as Code (JCasC). A plugin where your whole configuration is held in a YAML file and applied to your instance when it starts. Bonus: you could export the configuration from an existing instance.

The plugin, however, did not fit our needs; Errors were ignored, and if a value could not be set, it would be dismissed without warning. You either fully manage a plugin, or you don’t: if you manage credentials, the plugin will override all managed credentials; you can’t have some credentials managed by JCasC and others managed by the team. JCasC will delete the team-managed credentials on the next restart.

We tested this plugin in 2019; the plugin has probably evolved since.

This is why we took the alternative approach, configuring with Groovy: Jenkins can be configured by adding Groovy scripts in the jenkins.config.d/ directory, which runs on each start.

With lots of googling and reading other examples, we created our own set of Groovy scripts that automatically configured Jenkins global settings, Instance URL, custom theme, Kubernetes configuration and default pod templates, various credentials, and SSH keys, global Maven settings.xml, Rocket.Chat configuration, user permissions, LDAP configuration, and more.

If you wish to do the same, here are a few links with guides and examples of Groovy scripts:

Quick tip if you wish to go this route, don’t be afraid to dig into the plugin’s source code, as each plugin is configured slightly differently, edit configurations from the UI and check the result on configuration files.

Teams could still experiment with their Jenkins instance by tweaking settings. If, by mistake, they changed something that prevented builds from running, a simple restart of their instance would bring it back online.

Step 3.5: Reorganization!

In the Summer of 2019, our two development departments were merged into one department, and many teams were re-shuffled.

All Jenkins instances had to be renamed as some teams were split, merged, and the department’s name changed. For example, my team’s Jenkins was renamed from swd-dtg to swe-dtg.

We took this opportunity to move all instances to more performant machines and put disk quotas for each team. Instead of growing infinitely, teams had a space limit set according to their current usage, with the possibility to ask for more if needed.

We did this by creating a disk partition for each team. Linux’s Logical Volume Management makes creating, resizing, and removing disk partitions a breeze. This initially caused lots of support requests for us as teams would hit their space limit often. After analysis, we either deleted some builds or increased the available space.

To simplify the support process, we created a small Self-Service tool using DUC, a tool to visualize disk usage and instructions on deleting files once they found the culprit. This drastically reduced the number of support requests about disk space.

DUC displays disk usage in a diagram, making it very easy to see what takes the most disk space in a Jenkins instance

Step 4: Building Jenkins on Jenkins

At this stage, creating a new Jenkins instance was easy as pie. Create a YAML file, run helm create, wait a few minutes, and the instance is ready to run builds.

Our next goal is to deploy Jenkins automatically and regularly. To achieve this, we need to automate the build of the Jenkins Docker image, start it, and ensure no errors were found in the logs.

The challenge we faced at this time was that the build farm itself didn’t have access to the internet, which made it impossible for us to install plugins. We created an app that acts as a proxy to the Jenkins Update Center with a caching mechanism. We could now enable Renovate (More on that in Part 3) and automatically get updates for Jenkins.

Step 5: Deploy changes automatically and regularly

Applying an update to Jenkins was still a manual process, as we needed to run helm update for each instance when updating Jenkins. Our SRE Team had started maintaining their Kubernetes clusters, and we could benefit from their tooling, particularly ArgoCD.

ArgoCD is a declarative, GitOps continuous delivery tool for Kubernetes. It automatically updates the Kubernetes configuration based on the content of our GitHub repositories, eliminating the need to run helm commands.

We configured ArgoCD to read our helm charts and added a last step to our build process; update the Docker image version for our Jenkins instance to test the new version.

Once we’re happy with the change, a single-line change in our GitHub repository updates Jenkins for all teams.

Takeaways

And that’s how we industrialized our Jenkins infrastructure. As mentioned, this process took months of minor improvements, but in the end, the effort was worth it; we got

Stable instances: A simple restart to an instance restores the configuration to a stable state.
Easy updates: We’re happily running the latest Jenkins LTS and are not afraid to update when a new one comes out. We’re also able to roll out configuration changes within minutes.
Flexibility for teams: Adding pod templates, managed files, and other configuration tweaks is still possible and allows teams to experiment and implement their needs when the defaults don’t match.
GitOps ensures consistency across our whole fleet of Jenkins instances.

But we also learned some things the hard way:

Leave Kubernetes to professionals, don’t build your cluster if you don’t have the resources to keep it up-to-date and know how to operate a cluster of machines. Kubernetes doesn’t replace sysadmins; if anything, you would need more sysadmins. (As we’ve mentioned in part 1, we’re not on the public cloud because of regulations)
For every finite resource, you’ll reach its limits sooner or later: Cache disks filling up at an alarming rate, RAM usage, Disk usage of build reports, and uncleaned processes hogging the CPU. You need cleanup strategies for every resource.

Stay tuned for Part 3

In Part 2, we made sure our configuration was robust and reproducible and got disk space under control. Part 3 will cover how we got all other resources under control, including fine-grained monitoring and updating to the latest version of Kubernetes.

Edit 06/10/2023: Part 3 is now published;

From 0 to 10'000 Jenkins builds a week — part 3: keeping the workload under control

In 2023, Swissquote runs 50 fully automated instances of Jenkins in Kubernetes, one per team. Each code push from a…