No one enjoys making silly mistakes when it comes to debugging projects, and the pain felt by those mistakes can be amplified exponentially when it takes hours of building or compiling code to discover them.
The initial setup of DetectionLab was an absolutely brutal process. Packer builds might run successfully for 90 minutes and then randomly time out while running Sysprep at the last step. Sometimes Vagrant would intermittently lose connections to the hosts it was bootstrapping, or I might typo a command and it would be hours into the build process before the bug became apparent. Waiting hours for a build to get to the point where you can reproduce a particular bug is nobody’s idea of fun.
It didn’t take long for me to realize that this project would benefit immensely from having a continuous integration pipeline, but finding a workable solution evaded me for months. There were plenty of examples of people using CI to ensure Packer builds completed successfully, but that’s just a small piece of DetectionLab. In a nutshell, the build process consists of the following steps:
- Packer builds a Windows 2016 box from an evaluation ISO, downloads & installs Windows Updates, applies custom configurations, and syspreps.
- Packer builds a Windows 10 box from an evaluation ISO, downloads & installs Windows Updates, applies custom configurations, and syspreps.
- Vagrant downloads an Ubuntu 16.04 box, installs Splunk, Fleet, and Caldera
- Vagrant brings up a Server 2016 Domain Controller, creates an AD forest, installs multiple GPOs, and configures security monitoring tooling.
- Vagrant brings up a Server 2016 host that joins the domain, collects forwarded Windows Event Logs, Powershell logs, and forwards them to Splunk.
- Vagrant brings up a Windows 10 host that joins the domain and configures security monitoring tooling.
There are a LOT of moving parts in this build process, which translates to ample opportunities for things to go awry. I also wasn’t able to find a single existing example of a continuous integration setup that supported bringing multiple networked Vagrant instances online.
Narrowing Down Issues
You learn a lot when you open source a project. I was thrilled by all the positive reception DetectionLab garnered, but somewhat dismayed to see how many people were having issues building it. Instead of viewing each new issue as a personal failure, I used each one as an opportunity to improve the build process by making it more robust.
Triaging issues helped me identify which parts of the build were most flaky and causing the most issues. This process ironed out many of the issues that could only be discovered by having many different people attempting to build the lab in many different environments.
Designing the Pipeline
I had 3 requirements that I was attempt to meet while trying to find the right continuous integration solution:
- It had to support nested virtualization (required for Packer & Vagrant)
- The system requirements needed to meet DetectionLab’s requirements
- I wasn’t going to host my own server to run the build automation software ($$$)
I was initially pleased to discover that Google Compute Engine supports nested virtualization, but for whatever reason the Packer builds would cause the entire instance to lockup during the sysprep phase. After many wasted hours attempting to debug locked up GCE instances, I gave up. Not long after, I came across an enlightening blog post from cilium.io. They were using a baremetal provider called Packet to test Vagrantfiles in conjunction with Jenkins.
I started down the same road hoping I could emulate their setup, but as I previously mentioned, one of my requirements was that I was not going to host my own server with a build automation system— it would simply cost too much money to keep the server online all the time when I only needed to run a few builds per month. That fact, combined with the history of Jenkins security vulnerabilities was enough to steer me away from going that route. To a security engineer, the installation guide for Jenkins reads like a horror story:
CircleCI became a much more attractive build system at this point. It’s self-hosted, free* (for 1 container & 1500 build minutes), and easily integrates with GitHub. Although it is now obvious in retrospect, the part I struggled with at the time was: “how do I use a Packet baremetal server to do the building and CircleCI to do the reporting on the build results?” A conversation with a coworker filled in the gap that I needed — I could use CircleCI to call the Packet.net APIs, provision a baremetal server, and transfer files needed for the build!
Putting It All Together
I can’t definitely say that what I’ve built is the most efficient, or even an elegant way of accomplishing this goal, but it does work! Here’s how the continuous integration system currently operates:
- When code is pushed to the master branch, CircleCI detects this change via the GitHub integration and automatically spins up a test build instance.
- Instead of building DetectionLab on the Circle instance (it doesn't support nested virtualization, nor meet the hardware requirements), it runs the commands detailed in the config.yml to call a few Packet APIs to provision a baremetal server with 32GB of RAM and 4 CPU cores. Because it’s a baremetal server, it natively supports virtualization.
- After the Packet server comes online, the Circle instance uses
scpto copy the DetectionLab build script to the Packet server and
sshto execute it remotely.
- The Packet host begins the build process in a tmux session so that it’s easy to log in and view or debug the build process. As part of the build process, the Packet server starts a webserver and serves a webpage with the text: “building”. Once the build is complete, it will update the webpage with the text “success” or “failed”. Most importantly, the Circle host then calls the Packet API and tells it to destroy the instance so I don’t run up a very expensive bill!
- While the build is occurring, the Circle host polls the webserver on the Packet host using a while loop. If the webpage still responds with “building”, it sleeps for a bit and checks again. If the page eventually says “success”, it exits cleanly and reports the results. If not, it uses
exit 1so that Circle knows the build was unsuccessful.
Despite the progress I made, I know there are still many things that will need to be improved with this setup. My build process assumes a lot of things will succeed that could potentially fail:
- The API call to provision a server from Packet could fail
- The DetectionLab build process could terminate prematurely, leaving the Circle host in a permanent loop (and me with a large bill for Packet!)
- Dependencies could change, requiring updates to the actual build script
These types of events need to be planned for, and good code should account for as many failure cases as possible. However, for now I am going to celebrate the small victory that is a functional build pipeline for a networked and multi-instance Vagrant-based lab setup!