Testing at Scale
By Guillaume Chenuet (DevOps Engineer Productivity)
The Engineer Productivity team at leboncoin is in charge of all CI/CD processes and tools, from the Code Review and Git repository to package management and delivery pipelines.
On our platform we run up to +20K builds/week on ~150 servers (On-Premise and AWS). All builds are executed in containers with Docker (+150 containers/week) and delivery processes are managed by some homemade tools, Ansible and ConcourseCI pipelines.
But last year we faced the limit of our Continuous Integration platform and we started to rethink it from A to Z based on our knowledge of CI and leboncoin processes.
Now, after many months passed, it’s time to sum up some good and bad practices of CI that we learned and implemented at leboncoin.
Some of them will seem obvious or already known but may be useful as reminders.
Keep It Small
One of our first mistakes was managing all builds on a single Jenkins instance. Even if this server was quite powerful (24 vCPUs, 145Go RAM, etc.) and dedicated to Jenkins, we faced a lot of instabilities or failed tests and did not afford an enjoyable user experience to our developers (~150 people).
The solution here was to split builds into separate Jenkins Master instances.
That should sound obvious, but when we started our CI platform with only a few builds, everything worked fine. But with time we added new builds for new projects and one day we realised that our server was hanging and failing to execute and schedule new builds.
Another (temporary) solution would be to use more generic jobs or a SSD backend and continuing with your mono-server but consider it as a first warning.
BTW, if you’re interested about tuning the Jenkins Garbage Collector, this article is a very good start.
In our case, we decided to split our Jenkins Masters by feature teams and used another one for package and image builds.
Last thing: To manage and version your jobs configurations between master nodes, we highly recommend using the Jenkins Job Builder project.
The Jenkins Job Builder (JJB) takes simple descriptions of Jenkins jobs in YAML or JSON format and uses them to configure Jenkins. You can keep your job descriptions in human readable text format in a version control system to make changes and auditing easier. It also has a flexible template system, so creating many similarly configured jobs is easy.
Tips: Disable the Weather column on Jenkins. When an instance grows to be very large, its folder structure has many levels, the generation of this weather column can seriously impair system performance. More info here.
Slave nodes
The advice below can work with Jenkins slave nodes too.
Let’s take our previous example: As our master has a lot of build jobs, we need to set up more and more slave nodes to meet the increasing demand.
But adding a significant number of slave nodes also increases the number of open files on the master server and can produce more I/O on your disks and reduce the response time.
In this case, it’s important to find the correct ratio between slave nodes and executor slots. Having too many executor slots can produce a high load average or some OOM on nodes, but on the other hand having many slave servers can freeze or hang your master server.
There is no perfect answer; each CI is different and you need to analyze your jobs (execution time, resources, etc.) to find the correct setting.
Here we have created 5 cloud templates based on our jobs (docker, go, tools, builds) of slave servers with different amount of CPU/RAM & Jenkins slots. We spawn them on demand and keep their numbers as small as possible on masters.
Hire a Driver!
Having many masters adds new problems too.
Let’s see some examples:
- Having a mono-master adds the possibility to run cascade jobs/pipelines.
This means that once a job is completed, Jenkins is able to launch other jobs depending on the result of the parent job.
It should be very useful to build or deploy packages if tests succeed for example. - If you’re using Gerrit for Code Review, you already know that you can trigger jobs in Jenkins depending on Git events (patchset-created, ref-updated) and have results in your review under Vote labels.
But how do you ensure that the final note (-1/+1) gathered all the jobs?
To solve these questions, we chose to use the OpenStack Project Zuul.
Zuul is a pipeline oriented project gating and automation system.
Zuul watches events in Gerrit (using the Gerrit “stream-events” command) and matches those events to pipelines. If a match is found, it adds the change to the pipeline and starts running related jobs.
The gate pipeline uses speculative execution to improve throughput. Changes are tested in parallel under the assumption that changes ahead in the queue will merge. If they do not, Zuul will abort and restart tests without the affected changes. This means that many changes may be tested in parallel while continuing to ensure that each commit is correctly tested.
Zuul is composed of three main components:
- zuul-server: scheduler daemon which communicates with Gerrit and Gearman. Handles receiving events, launching jobs, collecting results and posting reports.
- zuul-merger: speculative merger which communicates with Gearman. Prepares Git repositories for jobs to test against. This additionally requires a web server hosting the Git repositories which can be cloned by the jobs.
- zuul-cloner: client side script used to set up job workspace. It is used to clone the repositories prepared by the zuul-merger described previously.
As Zuul is a pipeline oriented project, you can define different types of pipelines based on different trigger actions.
Let’s see some pipeline examples:
- check-build: Newly uploaded patchsets enter this pipeline to receive an initial +/-1 verified vote from build test jobs. Ex: build packages, docker images, etc.
- post: This pipeline runs jobs that operate after each change is merged.
- release: When a commit is tagged as a release, this pipeline runs jobs that publish archives and documentation.
- periodic-nightly: This pipeline has jobs triggered on a timer e.g. for testing environmental changes each night.
It also provide a web-based Dashboard to follow execution runs: here is OpenStack’s Dashboard or see leboncoin’s below.
Zuul is used by some major Open Source projects such as OpenStack and Wikimedia and is very scalable and robust.
Tip: Think about using template names for your new build jobs; it will be easier to read and understand between Zuul, JJB, Gerrit and Jenkins.
Example: [type]-[project]-[distrib/purpose].
Verify
Once the platform is deployed and running, the next big step is to monitor it to ensure everything is working as expected for developers.
On the host level, we’re using our Sensu+Uchiwa stack to perform basic checks (CPU, RAM, Disk I/O, etc.) and CI checks (Jenkins SSH connection, services daemon, etc.).
We are also using active remediation actions triggered on specific events.
Active remediation reads configuration from a check definition and triggers appropriate remediation actions via the Sensu API when the occurrences and severities reach certain values.
For example, our Jenkins Slave instances are using AWS EBS volume to isolate the Jenkins workspace and Docker directory.
If the Docker partition reaches the warning level (~85%), Sensu will execute a docker system prune -f
command to delete all unused containers, images or networks without informing us.
If this action isn’t sufficient, Sensu will raise an alert and ping us on PagerDuty or Slack.
These remediation actions save us a lot of time on our ‘daily run tasks’ and provide a better service quality for our users.
To inform ourselves, we also added a public dashboard for our main services based on Cachet and plugged into the Sensu API.
In addition, we configured all our 500+ jobs to send their console logs to a ElasticSearch-Kibana cluster to analyze them and find failed patterns.
Having all logs in the same place is very useful to understand or improve things.
With 20K+ weekly builds and 150K running containers, a perfect example could be to find which Jenkins Master and Slaves are the most used and try to split jobs fairly between them.
Another example could be to find all infrastructure problems (meaning Docker problems, high CPU loads, Git failed checkouts, etc.), create a Kibana view, sort them by type or host and find patterns.
We also built a light dashboard for developers with all run jobs and useful information like job name, Git project, Zuul pipeline, Gerrit patchset or build results.
The most important thing here, besides tools or technologies, is to find a good ratio between information value and time saved.
We always try to provide tools that can save us time or add real positive values for our or our developers’ daily routine.
Say Cheese!
Finally, a simplified summary of our current CI platform: