It’s Tuesday… Jenkins is down

Published in

FlexiSAF

4 min readJun 12, 2017

I woke up Tuesday morning to an email from AWS reporting a malicious activity on one of our instances. The report found an activity resembling “scanning remote hosts on the Internet…”. This confirmed our suspicion that something might be wrong with the CI instance. The instance contained our Jenkins (V2.32) server and some of our internal tools.

During the weekend (the Monday was a public holiday) Jenkins had been misbehaving. On Sunday morning, I had been on Jenkins but by afternoon, one of my colleagues complained to me that Jenkins was inaccessible. So I opened my browser to check it and it wasn’t reachable. After confirming that other internet facing applications running on the same instance were up, I ssh into the instance and started Jenkins, or so I thought I did. On the browser it showed “Jenkins is starting…” and I was like ok Jenkins is running. Little did I know that Jenkins will be starting for rest of the weekend.

Back to Tuesday morning. rolling off my bed, I refreshed my email to see the mail from AWS and that the CI instance has been stopped. Lemme take a moment to explain what that means. Without the ci instance, all code committed would remain on the repo without being built and shipped to our clients (continuous delivery and deployment has stopped), our databases won’t be backed up, automated CI checks, tests, builds and everything else that happens when a developer pushes code also won’t run. Also, our internal tool that used to simplify setting up new tenants for our products and other things (more on this in another post) was down.

Again, back to the Tuesday morning. The email now a thread had a reply from another colleague with details on his investigation. Having ssh into the instance, was able to see all connections to the instance, the program using suspicious connections, with the help of IP lookup the location of the initiated connection (and as you can guess its…) and where on the file system the program was executing from. This all happened before the day even broke. The emails were exchanged around 5–6am and the instance was shut down. As the day progressed, we were able to identify what the vulnerability was on Jenkins (SECURITY-429 / CVE-2017–1000353) which upgrading to the latest version would have solved it.

What actually happened. A Russian “IP” exploited an Apache strut vulnerability on our Jenkins deployment to upload a malicious program. The program scans through all ports on the network to try and replicate itself on other machines. It uses all available resources on that machine to run kworker34 (a malware for mining bitcoin). And that was why Jenkins couldn’t get enough resources to start. Why it stopped in the first place I don’t know.

And again back to Tuesday, now the race was on to setup Jenkins with all the previous jobs. The first thing we did was to setup a new Jenkins but this time it was on GKE (I’ve been wanting to take Jenkins to google cloud because EC2 ephemeral slave isn’t cost effective and using Kubernets to spin up JNLP slaves seems like a better idea). Then setup user access control on Jenkins and start configuring the jobs. This was what we did the whole day(s) with one issue after another like, JNLP docker slaves on the k8s weren’t building our Docker images (docker-in-docker). It also didn’t have flyway, awscli, etc. so, unfortunately, as a temporary fixed I had to setup an EC2 slave 😞. Then built a flyway docker image FROM jenkinsci/jnlp-slave.

Again, I woke up to an email, but this time it was a week later. The email was an appreciation for removing the shackles on developers placed by the lack of an automation server and other internal tools caused by this incident. Having had issues properly configuring the new GCP Jenkins and our jobs, had to switch back to an EC2 instance to deploy our ci tools. Fortunately, we were able to back up the Jenkins home directory. The new plan was then to containerize Jenkins (of course mounting the Jenkins home backup) and setup SSL (which was missing on the old Jenkins). A post on this coming up later, IsA.

Lessons learnt:
* GCP is hard to switch to from AWS.
* It’s very important to update Jenkins (or any software) as soon as a stable version is released especially if there’s a security fix.
* Take security seriously, it could have been a lot worst. If the malware had infected all our instances, all our products would have been down.
* Also ensure only those that NEED access get it. Don’t just share ssh keys, allow network access from all IPs, put multiple instances on the same subnet, etc.

Moving forward:
* I’ll still try to set up Jenkins on GKE although my free credit is exhausted. (ps. Google, I could do with some hand holding guide and free credit ¯\_(ツ)_/¯ )
* Add SSL to all internet facing applications (Let’s encrypt gives free SSL)
* Always setup backup to s3 for the home/data dir of all apps and services. (Another future post maybe)
* Ensure all deployments are reproducible at any time (Ansible or terraform maybe)

Doing some research while writing this, I found out that the exploit was widespread around the same time it happened on our server. It affects a lot of servers running old versions of Jenkins.

It’s Tuesday… Jenkins is down

Written by AbdulBasit Kabir