From 0 to 10’000 Jenkins builds a week — part 4: migrating to a new environment

Published in

Swissquote Tech Blog

12 min readNov 15, 2023

In 2023, Swissquote runs 50 fully automated instances of Jenkins in Kubernetes, one per team. Each code push from a developer results in a build in the team’s instance. If the project doesn’t exist yet, it gets created automatically. The cluster performs approximately 10,000 builds per week.

Part 1: A setup everyone loves
Part 2: Automating a fleet of applications
Part 3: Keeping the workload under control
Part 4: Migrating to a new cluster ← you are here

For engineers, migrations are one of these topics that always triggers new debates. There is just so much literature on how to think, plan, prepare, execute, and finish migrations. How do we change the old/dirty/legacy/outdated system into a new/shiny/modern/fantastic system? Throw everything out and rewrite? Create a new system next to the old system and ask users to migrate at their own pace? Replace pieces of the old system until it is a shiny new system?

This article will walk you through not one, not two, but three consecutive successful migrations. How we moved all SonarQube instances, Jenkins Controllers, and Jenkins builds from an old and unmaintained cluster to a state-of-the-art Kubernetes cluster.

Steve Ballmer did not say that, but you get the idea.

2023: New year, new cluster, new rules

2023 was a significant milestone for our build farm with two challenges:
1. we want to shutdown our Kubernetes cluster on 31/12/2023.
2. we will hand over the care of the build farm to our Productivity team since it better fits their mission.

We decided to separate both challenges into three steps:
1. Migrate all Jenkins related workload to a new cluster.
2. Hand Jenkins over to Productivity.
3. Shutdown the rest of the cluster.

We do this so that we can avoid teaching another team how to operate our custom cluster.

To migrate all workloads, we needed to find a suitable replacement for our current cluster. The needs we have for Jenkins builds are different from the usual end-user applications we operate at Swissquote. Our Site Reliability Engineering (SRE) team has helped tremendously for this exploration.

Our builds need to start Docker containers, but our standard Kubernetes install doesn’t have a Docker runtime, so our current setup leveraging Docker out of Docker is out of the question.
Our Jenkins controllers store their files, such as Jenkins configuration and build results, on Kubernetes nodes’ disk. We knew it was a bad idea; a maintenance or outage on that node would make the controllers on it unavailable or scheduled on another node without their data, but that’s the only way we knew at the time.

It took them a few months of research and exploration, but we settled on a stack that would suit our needs:

Ceph for storage: Distributed storage that Kubernetes can use.
Kubevirt: Virtual machines to run the builds in.

Now that we know what our final state looks like, we must form a plan to move all Jenkins Controllers (AKA, the Jenkins UI) and data to this new cluster without downtime or broken builds.

Step 1. Migrate SonarQube

When talking about Jenkins, we generally mean Jenkins+SonarQube as most teams have one instance of each for their CI. We decided to start this migration with Sonarqube since its data is stored on a DB that is external to the cluster, therefore we only need to handle a stateless container at this stage.

Moving SonarQube to the new cluster successfully would help us validate critical aspects for the migration for Jenkins as well:

How to deploy and operate apps on the new cluster.
Which network openings need to be requested.
How to transfer the workload progressively, one instance at a time.
What are the necessary changes in our Helm charts to accommodate the version jump from Kubernetes 1.13 to 1.24.

Starting with SonarQube will defer the Jenkins Controller storage question and give extra time for the SRE team to properly set up the Ceph cluster that it requires.

We were confident that the new cluster worked fine, as many apps were already running on it. The part we were not sure about was how our application’s configuration for the old cluster would work on the new cluster.

Migrating all SonarQube instances went through four key steps

The process we applied to migrate them is essentially the following steps:

The initial setup in the old cluster, one container for SonarQube, one container for the DB
Export the embedded database and import it into an external database set up by our DBA team. With one schema and user per instance
Replace the old instance with a reverse proxy to the new cluster (details below).
Once all instances are moved, we point our DNS to the new cluster and shut the reverse proxy down.

Each SonarQube instance is exposed on a sub-path of a single domain, like https://sonarqube.domain/team. Since DNS is applied for a whole domain, we needed a solution to move one subpath at a time and not touch the others.
We chose to use Nginx as a reverse proxy that would forward the traffic to the new cluster. This meant we could start instances on the new cluster, and control when to enable incoming traffic to the new instance.

This took a few weeks of preparation and a few days of execution and was a resounding success. The average downtime per team was around 5 minutes, mainly due to the database export and switch.

Step 2. Migrate Jenkins Controllers

Once we finished the SonarQube migration, the Ceph cluster was ready for our Jenkins storage needs.

This step aimed to move all Jenkins Controllers to the new cluster while still running the builds on the old cluster.

We created a test instance on the new cluster to ensure the whole setup works, this allowed us to adapt our Helm chart to make it work on both clusters. As we’ve covered in Part 2 of these articles, our setup was fully automated and repeatable. With the knowledge acquired in the previous migration, it only took us a few hours to get a test instance up and running on the new cluster.

Apart from slight changes in our Helm chart, the main challenge was the way Jenkins Agents (AKA builds) connect to the Jenkins Controller.
By default, Agents communicate over TCP, which works fine within the same network, but once the Controller and Agents were in separate clusters they also were on separate networks. Within the company, the network is open only on HTTP protocols.
Luckily, Jenkins Agents also support Websockets to communicate with Jenkins Controllers, flipping a switch in our configuration was enough to get Agents to connect to the controller again.

We started the second migration once we were confident that the setup worked.

The logos have changed, but the migration strategy is similar

Here are the steps we took to perform this migration:

The initial setup in the old cluster, one container for Jenkins, and the local partition
Copy all the data with rsync over to one of the new machines. This was done to reduce the time needed to stop the controller on one cluster, copy the data, and start the controller on the new cluster.
Replace the old instance with a reverse proxy to the new cluster.
Once all instances are moved, we point our DNS to the new cluster and shut the reverse proxy down.

This migration was trickier since the source Jenkins data were distributed on three machines and had to be redistributed for each team on a specific Kubernetes PersistentVolume.

We did a test run at migrating our team’s controller and four more since they went fine. Running an incremental rsync was quick; we decided to move all remaining instances over a single lunch break.

That’s when we got a surprise guest; inodes

inodes are the metadata about your files, each file is described by an inode and, little known fact, there is a limit to how many inodes you can have inside a partition. The number of inodes you can have in a partition depends on the filesystem and the size of that partition.
You can definitely hit a “no space left” error while df -h will tell you that you have 50% disk space left … Try running df -i and you might see that you reached the inodes limit.
Usually, inode issues appear when you have a large number of very small files; the disk space gets filled slower than the number of available inodes.

Since Jenkins produces tons of small files and we decided to use small partitions for each team, some teams with a large number of projects reached the limit of inodes while we copied the data over.

To solve this, we recreated the affected volumes with a different filesystem that offers a better inode to disk space ratio.

We could have mitigated that risk by doing the migration one team at a time, but were confident in our ability to migrate everything quickly and without downtime. It was the case for some teams and others could not run builds for three hours. It wasn’t our most significant success, but on the up side we migrated all the data without losing a single file.

Step 3. Migrate Jenkins Workload

The first two migrations were a piece of cake compared to the third one. We had an extensive list of challenges to solve to be able to perform this migration:

Control our sandbox workload to not spill over to other builds.
Provide dashboards to know the resource usage from a single build to the whole cluster.
Require no configuration changes for our 1’000 existing build jobs.
Use one virtual machine per build and start our builds quickly.
Run builds in independent environments but still share a cache of “commonly used” Docker images and artifacts.

Spoiler alert: we succeeded.

Step 3.1 Starting builds in Virtual Machines

To start, we configured a second “Cloud” in the Jenkins Kubernetes plugin to point to the new cluster.

But that directly came with a first challenge: the Jenkins Kubernetes plugin can start Pod resources while Kubevirt works with VirtualMachine resources.

The usual “an image is worth more than a 1’000 words” applies here; this is how the Jenkins Kubernetes plugin works.

Jenkins creates a pod with a Jenkins Agent Docker image. Once started, the agent connects to Jenkins to run the build.

To work around the plugin’s inability to start Virtual machines, we decided to create a pod that would start a VM:

The plugin starts a pod and then starts a VM to run the build-in.

This works like a charm; with some Kubernetes annotations, we can automatically shut down the VM once the pod is shut down, leaving no resources hanging.

Step 3.2 Testing with early teams; opt-in

While we had only 50 Jenkins Controllers, we have over 1’000 projects building, all with their various CPU/Memory needs. We needed a more granular approach to do the switch.

To run our tests, we had one controller able to build on Kubevirt; we automated the configuration of this new cloud and rolled it out to all controllers but left it behind a flag; every project had to opt-in to build on the new cloud.

We then made a script to automate the switching of builds from one cloud to another using the Jenkins’ Kubernetes plugin’s label feature and an environment variable when starting the controller.

Default labels move when changing the “DEFAULT_CLOUD” environment variable. Labels with the “-old” and “-new” suffix remain attached to their cloud.

Teams could use labels that would automatically switch from one cloud to another, such as “default” or “java,” and other labels that would force the use of a specific cloud, such as “default-old” or “java-new.”

Since the cluster has a finite amount of resources (CPU/Memory). Assigning the same amount of resources to all builds, would not allow us to schedule many builds in parallel and risk underusing our resources. On the other hand, giving too few resources to a build risks making the builds very slow. We decided to provide a reasonable default amount of resources and the ability to override the value with a set of presets; small/medium/large/xlarge. The default can be set per controller and changed per project.

Teams interested in trying the new process could change the label in their repositories’ Jenkinsfile.

Step 3.3 Dashboards and right-sizing builds

Since we provide presets for build size, we also needed to provide feedback on resource usage.

We added a link in the build logs that would bring our users to a Grafana Dashboard. This helps investigate issues when a build reaches the limits defined by the size presets.

Every build logs where it’s running and a link to a Grafana board with more details

The dashboard would then display all available information, including a heuristic that can recommend changing the VM size for a given build.

The dashboard can give valuable information to get started

With this dashboard and a few more with a higher-level view, we were ready to onboard more teams.

Step 3.4 Switch teams to the new cluster

This time, we decided to switch teams one after another, and we did that in two phases.

Switch the pod template labels to use the new cluster by default, but use an “xlarge” default size.
Let the builds run for a week, then analyze the statistics to add a configuration for some builds to move them to “medium,” “large,” or “xlarge” explicitly. Then, switch the default size for that team to “small.”

Using different size presets allows us to understand better what builds run on the cluster. The smaller amount of resources reserved by small builds allows for running more parallel builds.

Moving all teams took four weeks, and most builds were successfully migrated.

Step 3.5 Almost done

As for every migration, there is a long tail. A few repositories (60) took a while to migrate.

The 60 repositories that couldn’t be migrated initially amount to 5% of builds still running on the old cluster

There are three main reasons why these builds don’t run on the new cluster:
1. Some custom configuration or tool is incompatible with the new setup.
2. Android builds don’t run inside VMs. Since Android simulators are also Virtual machines. Nested virtualization isn’t enabled by default in our cluster, on purpose.
3. The build didn’t work initially, and the team switched it back to the old cluster without checking how to get it to run on the new cluster.

Each of these builds requires manual investigation and sometimes even fixes in our internal libraries, which takes time.

The last blocker was fixed in mid-October. We then left three weeks for the few teams still running on the old cluster to migrate their jobs willingly before we shut down the access to the old cluster.

We shut down the access to the old cluster for builds in November 2023.
We can now remove all code and configuration for the old cluster, give the keys of the build farm to the Productivity team, and share a nice meal with all involved teams to celebrate the end of this migration and handover.

A big Thank You to the SRE and Productivity teams for their involvement in their work, SRE for their solution oriented approach and responsiveness, Productivity for their continuous support in help channels and for taking this new responsibility.

Conclusion

Planning and executing these migrations was a considerable effort on our team, but it came with minimal disruptions to our users and the following outcomes:

If a build fails, it’s because there is something broken in the build, not because another build next to it is suddenly hogging all resources on the build node.
Consistent build times.
99% of projects had no change of configuration to run on the new infrastructure.

As operators of these applications, we also had the following outcomes:

No need to worry about the Kubernetes cluster; we know it works and is up-to-date.
A routine maintenance or outage on a single Kubernetes node will not be a problem for Jenkins Controllers anymore.
We have a good overview of resource usage for a team or build.
Builds are encapsulated; no leaks of containers, access, or resources out of VMs.

Now that the cluster is in the Productivity team’s hands, it is a bittersweet feeling to no longer be in charge of that critical piece of Swissquote’s software delivery infrastructure. On one hand, it was a fascinating experience to work with all these technologies and build a product our teams love. But it also frees time for our team to realign our focus to our missions.

This is the end of my story for you; no part 5 is planned yet. But it’s not the end of Jenkins at Swissquote. I’ll make sure the productivity team tells you about our future challenges.