Automation of Jenkins Controller with Packer and Terraform

Published in

SSENSE-TECH

7 min readDec 8, 2023

Continuous integration (CI) is a software development practice that aims to improve the quality and speed of software delivery by encouraging development teams to frequently implement small code changes and merge them into a shared repository.

Jenkins is a self-contained, open-source automation server that can be used to automate various tasks related to building, testing, and delivering or deploying software. It is one of the most popular CI tools available today, and it is heavily used at SSENSE to:

Deploy microservices across multiple AWS Elastic Kubernetes Service (EKS) clusters.
Deploy serverless applications on multiple AWS accounts.
Run jobs, routines, pipelines for various purposes such as database backup, migration, data pipeline, reporting and end-to-end testing.
Build and deploy our AWS Infrastructure with tools like Terraform, Packer, Helm, and Cloudformation.

Back in 2017, we were using multiple Jenkins controllers, running on different EC2 machines, dedicated to automated tasks and jobs. As our teams grew and started adopting microservices, our development process became more structured. The definitions and conventions for continuous integration pipelines across the engineering department led us to adopt a new Jenkins controller for our venture into microservices. A manually created proof of concept was installed on an AWS EC2 machine and quickly evolved into a production tool.

By 2021, the use of Jenkins had become significant, highlighting the importance of keeping it up-to-date, secure, and reliable. But this wasn’t an easy feat since Jenkins was the only part of our infrastructure that had not yet been automated. So, the decision became clear: automate Jenkins’ infrastructure. This would not only save time and provide a better developer experience, but also drastically reduce our mean-time-to-recovery in case of disaster, from hours to minutes.

Initial state

1- Jenkins controller initial infrastructure

The entire system was manually created through the AWS console or Cloudflare admin dashboard, which involved the following steps:

Creating an EC2 machine from Amazon Linux AMI with a user_data script to install a specific version of Jenkins.
Creating security group with all the necessary permissions and attaching it to the EC2 machine.
Attaching an EBS volume to the EC2 machine within the console. This volume would then be connected to the machine and mounted to JENKINS_HOME: /var/lib/jenkins. It was essential to ensure that the volume contained all required files and configuration to enable recovery from the latest state of Jenkins.
Creating load balancers and assigning the EC2 machine as the target.
Creating CNAME DNS records that pointed to the load balancer.

To keep the system running smoothly, we needed to:

Upgrade Jenkins and maintain its configuration.
Upgrade EC2 images and apply security patches. This involved periodically upgrading our images, such as moving from Amazon Linux 1 to Amazon Linux 2, or applying necessary security patches.
Recover in case of a problem or disaster. If Jenkins is down, we cannot deploy anything to production, which could prevent critical application bug fixes from going live.
Maintain Jenkins access to AWS resources through IAM permissions. Jenkins is used to create and update many AWS resources, such as Lambda, Step Functions, API gateways, databases, and more. This is done using multiple deployment tools, including Cloudformation, Terraform, Packer, AWS CLI, AWS SDK and ad-hoc requests, such as performing database backup. Therefore, maintaining proper access to AWS resources was essential.
Maintain access to Jenkins up-to-date. There are 2 different ways to access Jenkins in our setup. Human users access the web interface to check their pipelines, configure settings and create jobs, while applications like GitHub with webhooks or internal services trigger jobs through API calls. These accesses need to be maintained and should evolve based on our security compliance.

All of these items required manual interventions, such as connecting to the server, applying the change, and restarting the service, if needed. These changes were not testable, which sometimes led to long upgrade times, crashes, bugs, rushed investigations, etc. There was no change history, meaning that knowledge about configuration changes and dependencies were lost. It also meant that the mean time to recover was counted in hours. Even when dealing with expected failures, like the termination of the EC2 machine, meant we lost all changes made to the machine and had to re-create the instance manually.

Improvements

There was definitely room for improvement, so let’s start by increasing our availability.

Availability

By adding an Auto Scaling Group (ASG), if we lose the EC2 machine running Jenkins, we will not need to recreate it manually. ASG will launch a new instance within a couple of seconds, serving as a new Jenkins controller. Since only one instance of Jenkins can serve, both minimum and maximum capacity are set to 1.

2- Jenkins controller infrastructure reviewed

We have seen earlier that we were not able to test any changes before applying them to production. Let’s now look at how we can improve that.

Automation

Instead of making changes on a Jenkins running server, we introduced the creation of Amazon Machine Image (AMI) golden image for Jenkins using Packer by Hashicorp. To deploy the entire infrastructure, we opted for Terraform.

Let’s dive deeper into how these 2 implementations have been set up.

Amazon Machine Image

Jenkins golden AMI provides the best solution to handle Jenkins controller updates, maintenance, and reliability. Within this AMI, we set up a process to test and validate any updates and configuration changes related to Jenkins before applying them to production. This process is fully automated by using Packer by Hashicorp, which offers many advantages:

Automated creation and update of the Jenkins golden AMI using HCL.
Enhanced stability as Packer installs and configures all relevant packages required for Jenkins, such as Java and Jenkins, during the image building process. This allows any bugs in these scripts to be identified early on, rather than after a machine is launched.
Improved testability after an AMI is built, since that image can be quickly launched and smoke tested to verify that everything is working.
Codified security and compliance baselines to ensure that Jenkins’ golden image is consistent.
Extended image management to allow provisioning workflows with Terraform integration.

4- The workflow of building Jenkins golden image

The “Run provisioners” stage is the step where we install and configure packages within shell scripts. In case of failure at this stage, the process fails and errors are reported. This ensures that only fully functioning golden AMIs will be created in the “Create and Register AMI” stage.

Here is a snippet of a Packer AMI definition:

https://gist.github.com/fdesouza-ssense/0b24483439e96ed7c88c665382b6e94e#file-ami-pkr-hcl

The ami.pkr.hcl file contains the definition of packer HCL2 template. The AMI is created based on Amazon EBS builder

Variables are added to the variables.pkrvars.hcl file, allowing to define which version of Java, Jenkins, or other dependencies to install. The value of the variable will be used by either the builder or provisioners to perform the installation and configurations.

https://gist.github.com/fdesouza-ssense/0b24483439e96ed7c88c665382b6e94e#file-variables-pkrvars-hcl

Several scripts are in the “Run provisioners” stage to install and configure packages on the machine. This is the script related to the Jenkins installation:

https://gist.github.com/fdesouza-ssense/0b24483439e96ed7c88c665382b6e94e#file-2-jenkins-sh

Infrastructure

The infrastructure automation relies on Terraform. Whenever we need to update a component of the infrastructure (e.g. adding more permissions to the instance profile, allowing access to Jenkins from specific IPs, updating prefix lists through security groups, or updating Jenkins controller with a new golden AMI), we update our Terraform code based on Terraform modules and the pipeline is triggered to apply the changes.

The entire architecture is deployed in a single availability zone. The benefit of infrastructure as code is that in case of unavailability of the current AZ, we can simply change the “az” variable in the Terraform code and redeploy the entire infrastructure to the newly selected AZ. Here’s an example of variable definitions in Terraform code:

az = "us-west-2b"
ebs = {
  disk_size = 1000
  iops      = "10000"
}
launch_template = {
  user_data_file   = "jenkins_setup.sh.tpl"
  jenkins_home_dir = "/var/lib/jenkins"
  device           = "/dev/sdf"
  sg = {
    sg_name = "jenkins"
    ...
    ...   
  }
  ami_name      = "jenkins-ami"
  instance_type = "m6i.4xlarge"
  tags          = { "JENKINS_VERSION" : "2.414.2", "JAVA_VERSION" : "11" }
}
asg = {
  max_size                  = 1
  min_size                  = 1
  health_check_type         = "ELB"
  health_check_grace_period = 10800
  private_subnets           = false
  protect_from_scale_in     = false
}
...
...

Conclusion

Overall, Jenkins controller automation addressed maintenance issues encountered during our initial manual installation. By using Packer, we were able to successfully build and test Jenkins golden AMI, which contains the desired configuration of Jenkins, before deploying it into production. This has helped us to significantly decrease maintenance time and provided us with reliability. Additionally, using Terraform has allowed us to leverage the benefits of Infrastructure As Code (IaC), ensuring a consistent, reusable, and scalable Jenkins infrastructure.

In the next article, we will use the automation foundation to allow us to leverage EC2 spot instances for jenkins jobs to significantly reduce our costs.

Editorial reviews by Catherine Heim, Luba Mikhnovsky, Mario Bittencourt & Sam-Nicolai Johnston.

Want to work with us? Click here to see all open positions at SSENSE!