Don’t Let Me Down: Jenkins at Shapeways

Hans Wang
Shapeways Tech
Published in
4 min readNov 3, 2016

The Challenge

As we mentioned in our previous blog post on deployment tools at Shapeways, Jenkins is a key component in our software deployment tool chain. It’s responsible for building, testing, and deploying nearly every production service to our customers across the world — from our 3D modeling and checking software to our internal manufacturing application to, of course, Shapeways.com! And it doesn't stop there: we also leverage Jenkins’ task execution ability to manage service reload and restarts via Hubot. As you can see, Jenkins is mission critical for us and we are heavily reliant on the old butler. Any downtime of our Jenkins master would result in an inability to deploy new code and grind our engineering team to a halt.

So how do we make sure this doesn’t happen? We can, and do use monitoring services such as Datadog or Sentry to detect when something is wrong with our Jenkins master. Once a failure is noticed, we have to scamper to get a replacement master up as quickly as possible; however, bringing up a Jenkins master replacement is often a manual process and easily takes more than a few hours especially with the size of our Jenkins instance. On larger projects, the downtime from a failure can be the equivalent to several days of lost project time as developers and engineers wait.

The Contenders

So in engineering how do we fix a single point of failure? We add redundancy or high availability! Before I dive into how our HA Jenkins infrastructure is set up, I wanted to walk you through a couple of other solutions we evaluated before going with our own in-house setup. One plugin from the open source Jenkins community is the Gearman plugin which provides a central server that coordinates work amongst multiple Jenkins masters. Using Gearman, we would simply bring up a couple of extra Jenkins master instances and have Gearman manage them in a pool. If one of the masters happen to go down — no problem — there’s no downtime as its workload would just be picked up by the other masters in the pool. This almost gets us what we want with multiple masters and no downtime but the problem is that the Gearman service itself is a single point of failure! The server where you’re hosting the Gearman service could fail and with no redundancy for the Gearman server itself you’re SOL. We could add redundancy to the Gearman server but that feels like more overhead than necessary.

Another option is to use (read: pay for) Cloudbees Jenkins Enterprise and use their High Availability Plugin. We have been using the open source version of Jenkins for over four years and it’s served us well so we were hesitant to switch to enterprise just for the sake of this one plugin. At Shapeways, we wanted to see if we could devise our own solution. With multiple Shapeways data centers across the world, why not host multiple Jenkins server instances in each datacenter? We decided to go with our own active-passive high availability Jenkins setup.

The Solution

The idea is simple — we would only have one active Jenkins master at a time that would accept all requests and dole out tasks to our Jenkins slaves across all our datacenters. If the active master ever happened to go down, it would be just as simple as enabling a new Jenkins master in DNS and adding an entry to point to the new master’s ip address. With our monitoring services in place, we would be able to detect a failed Jenkins master, deactivate it, and bring up a synced Jenkins master in just a few minutes. This passive-active HA configuration is not only simple to setup, but incredibly powerful. In fact we’ve “switched” Jenkins masters numerous times already during our infrastructure changes without a hitch.

However, there are some challenges to this approach and I’ll address them individually.

How do we handle ssh configs and keys on the different Jenkins masters since they all need to talk to the Jenkins Slaves?

Puppet of course, which manages all ssh configuration — thank you config management.

How do we deal with latency issues when your Jenkins master is in northern Europe but the slaves are in the southwest United States?

There’s only so much you can do about the transatlantic speed at which bits travel but delegating as much work to local Jenkins slaves as possible and minimizing the amount of data sends across data centers helps immensely. We aggressively optimize our Jenkins jobs to have local slaves perform as much work as possible to distribute builds to servers in its own data center. If you do have to send data across long distances, LFTP’s parallel download command can make large data transfers more manageable.

How do we keep the masters in sync with each other?

On a regular interval, we would take a snapshot of the active Jenkins master and replicate it to the inactive masters so we could bring any of them up at a moment’s notice

This diagram gives a simple overview of our HA Jenkins setup.

Looking ahead, the next step for us it to get all our Jenkins configuration into Git and we’re using Jenkins 2.0 Pipeline specifically for that. As the number of services we deploy and support grow here at Shapeways, our HA Jenkins infrastructure will need to evolve too to keep up. We’re looking into a distributed Jenkins setup for each of our services that we deploy. Stay tuned!

--

--