Scaling Jenkins

Jonathan Block
6 min readSep 2, 2018

--

Tips on scaling Jenkins capacity as your demand soars.

The following represents my opinions on scaling Jenkins based on my experience at both DoorDash and Lyft.

A Jenkins administrator has numerous ways to configure Jenkins and advise the rest of their company on how to use it. The following points represent my thoughts and playbook for scaling Jenkins.

  • The plain Jenkins community edition can be scaled to an unlimited capacity similar to how you would scale any other service you run. This is doable so long as you follow some guidelines and conventions like I have listed below.
  • With the custom techniques described in this post, I am able to launch as much Jenkins capacity as I want for my engineering organization.
  • Even if you are paying for the enterprise Jenkins hosted system, most of the following concepts still apply.
  • I have always run Jenkins on AWS and the subsequent bullet points will refer to AWS domain concepts. The underlying theory applies to other cloud providers.
  • Jenkins is master/slave architecture. (I’ll subsequently refer to “slave” as “worker”) When you connect workers to a master, that’s a “cluster”. Assuming you run a c4.2xlarge master, you can comfortably connect 125 worker nodes per master.
  • The idea of using Kubernetes for Jenkins workers sounds good to my ear because you can bin pack the Jenkins worker executors. However, in practice, I have found that Jenkins overall performance is much faster when I have plain Ec2 host nodes configured as workers. Plain Ec2 host workers helps simplify Docker builds because Docker-in-Docker build issues are avoided.
  • I recommend creating a slave AMI which has a minutely cron that reports into a DynmoDB table to announce it’s health. On the master instance, you can monitor that same table and attach & detach workers allowing you to easily scale up and down. Scale downs can cause running jobs to break. Plan on coping with unexpected worker loss by writing idempotent Jenkinsfile steps and wrapping the raw node() worker acquisition Jenkins command with retry logic that listens for node disconnects that automatically retries your work closure.
  • To allow yourself to safely upgrade or restart Jenkins with zero downtime, you need to have load balanced Jenkins clusters. That means at least two clusters behind some kind of balancing agent.
  • I use Terraform to define and deploy Jenkins infra. Avoid CloudFormation.
  • As a security policy, disallow your end users (developers at your tech company) to create anything in Jenkins by default. Most engineers only need to trigger routines on pushes to GitHub. More on this later.
  • Though Jenkins has multiple options for security, supplement Jenkins’ own security model by disallowing public internet traffic the ability to reach your Jenkins. Harden access to your Jenkins by requiring only VPN users the ability to access it.
  • In your master configuration, set the ‘executors’ on your master to zero. The master should not be used to perform work. This helps everybody to avoid doing things the wrong way.
  • I run a c4.2xlarge machine for Jenkins masters and m4.2xlarge for worker nodes. For each master, you can buy down the price with AWS reserve instance pricing. I run a low number of reserved instance slaves and the rest I launch using the spot market for a much larger discount. This is all configured with terraform.
  • Though Jenkins offers multiple options for defining jobs, write all of your Jenkins jobs as Jenkinsfile “pipelines” which are checked into Git alongside the source code in each individual service’s git repository at your company. One of the best features in Jenkins is that it can pull pipeline files from your Git repos and execute them allowing your service owners to define their own pipeline logic version controlled with the rest of their code.
  • Do not use Jenkins inline bash scripts or inline pipelines.
  • Jenkinsfile pipelines look awesome with Jenkins Blue Ocean.
  • Jenkinsfile groovy syntax has parallelization built in. Use it to do many things at the same time easily.
  • Communicate your pipeline steps back to GitHub’s pull request status api with deep links back to the relevant Jenkins pipeline or test result pages. At DoorDash, we wrote a Jenkins groovy shared library with features like this in it.
  • Do not attempt to manage your Jenkins configuration by checking in the Jenkins master XML files into source control and later deploying changes by updating those files and then reloading Jenkins. (I’ve seen this attempted and it was very confusing, brittle, and required many Jenkins restarts which are totally avoidable.)
  • Your Jenkins master should mount an EBS volume for the /jenkins_home directory and you should use AWS cloud techniques to snapshot that volume each day. Even if you have to restore a backup, virtually nothing will be lost (other than some test artifacts) because the business logic is implemented as Jenkinsfile pipelines pulled from your various git repositories at runtime.
  • To make load balanced clusters work, do not use the Git polling features built into Jenkins or else you’ll have multiple clusters triggering the same pipeline. Instead, you want to start Jenkins pipelines based on GitHub web hooks using the Jenkins REST API. This API is very clunky, but the integration I would advise you to create only involves starting a single job and starting pipelines based on events rather than minutely polling is very fast. More on this below.
  • Develop a thin ‘webhook gateway’ that catches all of your GitHub hooks and sends the interesting hooks to each of your Jenkins clusters. It’s at this moment that you implement load balancing for clusters with multiple masters. I don’t use an ELB for this. I do the load balancing with a consistent hash of the git branch name.
This example diagram depicts GitHub sending webhooks to a webhook gateway and then the webhook gateway communicating the webhook event information to a master in each hypothetical load balanced Jenkins cluster.
  • The webhook gateway communicates minimal information to each jenkins master at the same time. The pipeline I start on every single cluster is called “webhook firehose”.
  • Each master’s webhook firehose pipeline examines the webhook details and acts upon the hooks which it finds interesting.
  • When the web hook firehose detects a push event, I use Groovy scripting to idempotently build out a folder named by the GitHub repository which then appears in my Jenkins root. The webhook firehose also builds out Jenkins pipeline entries under that folder called “Jenkinsfile-deploy.groovy” and “Jenkinsfile-nodeploy.groovy”. This technique is why it is possible for me to disallow engineers from manually creating items in Jenkins. All folders and pipeline entries seen in the Jenkins interface are created by the webhook firehose pipeline.
  • I like to run a “general purpose” jenkins load balanced cluster. New microservice pipelines initially are hosted in the general purpose cluster but if they begin to generate enough load, I spin off a new cluster just for that microservice. A monolith codebase with many tests and CI demands may warrant its own cluster.
  • On pushes to any branch other than “master”, I start the nodeploy pipeline. On pushes to a branch named “master”, I start both the deploy and nodeploy pipelines. The service owners can implement any logic they see fit in the “Jenkinsfile-nodeploy.groovy” file or the “Jenkinsfile-deploy.groovy” file.
  • Monitor your Jenkins masters metrics. I use Prometheus and Grafana to monitor things like total executors, executors busy, the Jenkins queue, CPU of all of the machines, memory, and disk space.

If you are building out Jenkins at your company and need advice, feel free to ping me on my Twitter @blockjon.

Based on any questions posted to this article, I may revise the original post to clarify or add more notes on how or why I use the patterns mentioned.

If you love CI and CD pipelining, DoorDash is hiring.

--

--

Jonathan Block

CI/CD Automation Engineer. Formerly Lyft API Platform. RideShare startup advisor. Creator of 1st ad server used @ Facebook in ’04. Investor. Dog owner.