Cloud-native Jenkins on AWS

Alberto Alvarez
Bench Engineering
Published in
9 min readJul 16, 2019
“Butler in action” by Bernyce Hollingworth

We use Jenkins to drive our continuous delivery pipeline and like many other teams, we’ve built a lot of workflows and automation around it over the years. Jenkins is key for our team’s success, enabling us to ship to production 677 times during the last quarter with an average build & deploy time of 12 minutes.

The large majority of our applications and infrastructure can be considered Cloud-Native, but it happened that our Jenkins service didn’t really fit into this category: The service was running on a single server with many jobs directly running off the master, part of its config handcrafted including secrets, plugins, cronjobs and general bloat of startup hacking that was accumulated since it was first built in 2014.

Not only had Jenkins became a monolithic service and a single point of failure but the possibility of tearing it down and rebuilding it was a serious risk for the business.

We decided that this had to change. This blog explains how we made Jenkins Cloud-native using Terraform, Packer, Docker, Vault and AWS services such as ELB, ASG, ALB or EFS and what we learned along the way.

Jenkins State

The key question we had to answer was: if we are to put Jenkins service on a container/auto-scaling instance, what state do we need to restore?

The answer is not simple and is worth mentioning that there is a Jenkins SIG that has identified all of the storage components which result in the Jenkins state. This was a great starting point, as at the very least we had to ensure that all of the storage types listed in that article were accounted for.

The easy way out

This problem is not new. Many teams run Jenkins using a Docker container, and the official Jenkins Docker image is well maintained. As explained in the Jenkins Docker image docs:

docker run -p 8080:8080 -p 50000:50000 -v jenkins_home:/var/jenkins_home jenkins/jenkins:lts

This will store the workspace in /var/jenkins_home. All Jenkins data lives in there — including plugins and configuration. You will probably want to make that an explicit volume so you can manage it and attach to another container for upgrades.

The above example mounts the jenkins_home directory from the host, which contains all of Jenkins state. This directory could then live on an external disk, for example, a Kubernetes Persistent Volume, or if running Jenkins on EC2, an external EBS or EFS volume.

This is a valid approach, but we didn’t think it met our standards as jenkins_home includes state but also configuration. Block storage has great use cases but having to do a snapshot restore operation for a small config change didn’t seem like a great solution. In addition to this, we didn’t just want to move the problem somewhere else: external storage doesn’t protect us against handcrafted configuration, credentials being stored in the filesystem and so on.

SCM to the rescue

Historically, we’ve used the Jenkins Backup Plugin which essentially backs up configuration changes into source control and allows for configuration restores. The idea of this plugin is great, but we decided to not use it as we couldn’t easily control what data gets backed up and the plugin hasn’t had any updates since 2011.

So, what if we make jenkins_home a private Git repo and commit any changes made to Jenkins automatically? The key here is to exclude any binaries, secrets or large files which we store separately (more on that later). Our .gitignore file looks like the following:

/.bash_history
/.java/
/.kube/
/.ssh/
/.viminfo
/identity.key.enc
/jobs/
/logs/
/caches/
# Track static worker and exclude ephemeral ones
/nodes/**
!nodes/static-node/config.xml
/org.jenkinsci.plugins.github_branch_source.GitHubSCMProbe.cache/
/plugins/
/saml-idp-metadata.xml
/saml-jenkins-keystore.jks
/saml-jenkins-keystore.xml
/saml-sp-metadata.xml
/scm-sync-configuration/
/scm-sync-configuration.success.log
/secret.key
/secret.key.not-so-secret
/secrets/
/updates/
/workspaces/

Pretty much all of our plain-text config is now getting persisted in Git and all we need to do in order to provide this config to Jenkins is to checkout the repo on startup; things are starting to take shape.

Secrets

Jenkins needs access to a lot of places and that means that we need a secure secret storage. We are heavy users of HashiCorp Vault so it seemed like the natural choice, but unfortunately, Vault cannot cover all scenarios. For example, the scm-branch-source pipeline plugin requires auth credentials with SCM and it defaults to the Jenkins Credentials plugin for this. Dynamically retrieving these from Vault every time we need to sync a repository could lead to errors and would require additional effort to maintain.

That’s why we went for a hybrid approach between Vault and Jenkins credentials:

  1. On instance startup, Jenkins authenticates with Vault using IAM auth method.
  2. A bootstrapping script retrieves Jenkins master.key as well as other encryption keys used by the credentials plugin. More details on these can be found in this post.
  3. Credentials stored on jenkins_home/credentials.xml can now be decrypted and accessed by Jenkins.

Completely replacing the Credentials Plugin with Vault is something that we may look into in the future, but we are happy that this approach meets our security requirements while easily integrating with the rest of the Jenkins functionality.

Job and Workspace data

This is where things start to get tricky: both jenkins_home/jobs and jenkins_home/workspaces contain a mix between unstructured data, build artifacts and plain text. This information is valuable to us and can help us audit and understand our pipeline builds historically. Not only their size is considerable, but this wouldn’t be a great fit for SCM sync, hence both directories are excluded in the above .gitignore.

Where do we store these then? We decided that block storage was the best fit for this kind of data. Being heavy users of AWS, using EFS made sense as it provides a scalable, highly-available, network-accessible file store very easy to use. We put together an AWS EFS resource using Terraform, as well as a periodic backup plan using AWS Backup service.

On startup, we mount the EFS volume and symlink jenkins_home/jobs and jenkins_home/workspaces directories to the ones in EFS, then start the Jenkins service.

After that, Jenkins service is the only interface that can read or write job/workspace data. It’s worth noting that we have a Jenkins job that periodically deletes jobs and workspaces older than a few weeks, just so these don’t keep growing.

Packer and Terraform for a codified Jenkins

You are probably wondering how it all glues together. I didn’t even tell you where we are running Jenkins! We widely use Kubernetes, and spend a bit of time considering running Jenkins as a container but instead, we decided to use Packer and EC2 to run our Jenkins master, and ephemeral EC2 instances to run the jobs.

While the idea of running both master and workers as containers makes a lot of sense, we didn’t find a place on our current Kubernetes clusters for Jenkins to live in, and creating a new cluster just for Jenkins seemed an overkill. In addition, we wanted to keep this critical piece of the infrastructure decoupled from the rest of our services, so, if for example, a Kubernetes upgrade affects our apps, at least we want to be able to roll it back using Jenkins.

There is also the issue of running “Docker in Docker” which is solvable but still would need to be addressed as our builds use Docker commands often.

This is what the architecture looks like:

Being able to use EC2 instances resulted in a smoother transition: we were already using ephemeral worker nodes to run pipeline jobs via the Jenkins EC2 plugin and this logic was being called on our declarative pipeline code so not having to refactor it to use Docker agents was a plus. The rest of the effort just went into Packer and Terraform code which we are already familiar with.

Plugins

Because plugins are also state! One of the issues we wanted to solve in this project was to better audit and manage the plugins. In a hand-crafted scenario, plugin management can get out of control and makes very difficult to understand when a plugin was installed and why.

Most of the Jenkins-level plugin configuration can be found on the general Jenkins configuration xml, but installing a plugin also results in jar artifacts, metadata, images and other files stored in the jenkins_home/plugin directory.

Storing plugins in EFS was an option, but we wanted to keep EFS usage to a minimum and again, that wouldn’t solve the problem, would just move it somewhere else. That’s why we went for “Packerizing” the plugin installation:

Basically, on our AMI definition, we have a plugins file that lists the plugin and the version, something like:

# Datadog Plugin required to send build metrics to Datadog
datadog:0.7.1
# Slack Plugin required to send build notifications to Slack
slack:2.27

Then, our AMI provision script, parses this file and uses the Jenkins CLI to install the plugin as well as the selected version:

# Wrapper function for jenkins_cli
jenkins_cli() {
java -jar "$JENKINS_CLI_JAR" -http -auth "${user}:${pw}" "$@"
}
for plugin in "${plugins[@]}"; do
echo "Installing $plugin"
jenkins_cli install-plugin "$plugin" -deploy
done

Then any new plugins that need to be installed, or any version upgrades to the currently installed ones will require a GitHub Pull Request which will trigger a new AMI being built. Great!

Installing other software

Jenkins, by definition, requires a lot of software installed so it can build, test and deploy. First and foremost, we don’t want the master node running any jobs, so we avoid installing any job-related software on it. The main task of the master is to provide an interface and orchestrate builds on other ephemeral worker nodes.

This means that we could install the required tools on the worker nodes, but we decided to use docker run as much a possible. This is because we are a polyglot organization that uses Scala, Java, Node, Golang, Python and others… Maintaining different build tools for all these software stacks could make the worker node setup a bit complex.

Using JavaScript as an example, we want Jenkins to run yarn commands against an app such as install and test. We could do this by simply mounting our checked out repo directory as a volume into a Docker container, and running any commands from within the container. Here is an example using Groovy pipeline code:

def node(command, image) {
def nodeCmd = [
'docker run -i --rm',
'-u 1000', // Run as non-root user
'-v ~/.npmrc:/home/node/.npmrc:ro',
'-v ~/.yarn:/home/node/.yarn',
'-e YARN_CACHE_FOLDER=/home/node/.yarn/cache',
"-v ${env.WORKSPACE}:/app",
'--workdir /app',
"${image}"
].join(' ')
sh "${nodeCmd} ${command}"
}

Then we can just call this function after checking out our repository:

checkout scm
node('yarn install --frozen-lockfile', 'node:12.6.0-alpine')

This is great, as we don’t have to install and maintain several versions of our tools on the worker machines aside from the Docker daemon or kubectl and we also have the confidence that build commands are consistent between local and CI environments as the same Docker image is being used.

Something to keep in mind when using ephemeral nodes to build, is caching dependencies. For example our sbt cache was lost after a worker node was recreated, and that was causing slower build times as the cache had to be regenerated, or even failures if an external dependency is not available. We decided to keep relevant dependency caches stored on another external EFS for faster and more reliable builds.

Final words

Jenkins is a great tool, but it falls short in terms of managing external state which makes hard building it in a Cloud-native manner. Our approach is not perfect, but we believe it combines the best of both worlds and still guarantees security, ease to use and resilience. We are happy to say that after we finished the project and migrated over all production builds to the new Jenkins service, we were able to terminate the master server and let Autoscaling rebuild it in a matter of minutes, with no impact to the previously stored state.

If you are interested in learning more about Bench Accounting or a career with our Engineering team, then please visit us at https://bench.co/careers/

--

--