Building the Perfect CI/CD Solution

I’ve been bouncing around the concept around in my head for a while now of the perfect CI solution, one that can scale infinitely, run tests in parallel and automatically deploy and promote builds across environments at enterprise scale.

However, there are a lot of concepts to this — so perhaps I’m hoping that by fleshing this out in the form of a blog post i’ll be able to remember this clearly in my next contract to hit the ground running and build this solution out. I’ve previously touched on this subject at the end of my post here, but from attending talks around London recently — i’ve spotted that some companies are quite behind the curve on this.

Dealing with Code Changes

I’m generally acting under the assumption that most people accept that we “shouldn’t break the build” is a good practice. That’s why I have to insist at this point that we deal with these changes at a Pull Request level, and presume that when merged into master — the build will not break.

I mean that, to the point where I’m saying that with this proposed approach — we can even run all of our acceptance, integration tests (you know the one’s that previously took 3 hours), within a 20 minute window. That may seem like a long period of time, but we can use GitHub’s multiple status API to report back as each test passes.

In order to do this, you’ll first need your Jenkins slave’s configured in a neat scalable way. In an ideal world, I think this works out best by creating the Jenkins slaves as docker images, and then launch them into a Marathon-Mesos / Kubernetes cluster on demand — allowing dynamic allocation of resources.

First we perform the lint on the project. Check the syntax is okay, if the developer has chosen not to follow the team’s accepted coding guidelines, Jenkins reports back on the pull request the failure, and links the developer to the failed test.

In parallel to this we can run the unit tests — so essentially you have the following structure:

Running pull request subtasks in parallel

So in this scenario, if either fails — we go no further. Each reports back with the status of it’s test to GitHub. Next we’re going to create a build. If you’re not on the Docker circle-jerk yet, now’s your chance! As part of this job, we’ll also push the Docker image into your repository — whether that’s Quay.io, Tutum, Docker Hub or your own S3/Hub repository.

Now — this is roughly the same as the previous step, but just incase you’re not following — here’s a pretty picture that sums this up:

Now comes the next part of the tests. Remember those really fast tests I mentioned at the start of this section? Cutting those 3 hours of acceptance/integration tests down to 20 minutes — hold onto your hats.

So the projects that aren’t changing, you deploy these projects into Marathon at their current master versions, and then deploy your Docker into marathon to o— pointed at these dependencies. You would then receive the accessible url of the Docker image and be able to tell your test suite where to target.

In this flow we’ve got the test suite checked out in a Jenkins job. We list out each of the feature files, and pump these into RabbitMQ. There’s then a Jenkins job listening to a queue in RabbitMQ (many alternatives would be just as good — just no plugin’s already available). This Jenkins job can fire off as many slaves as it needs to deal with this queue in the fastest possible time, it then reports each of these results back into RabbitMQ.

The initial Jenkins job that fired off this test suite is now actively watching the queue, waiting for all of the jobs it’s triggered to complete (prudent to put a timeout on this process). Once we have the results, we concatenate the results of these jobs (suggest JSON as report format) and decide whether or not the tests have passed to whatever standard is set.

Continuous Deployment

Pushing the limits with automation

The last part comes to the holy grail — genuine Continuous deployment. Once we’ve merged into master — we deploy the project to our staging environment, along with the versions of whatever is live right now (a per environment & project version tracking API may be useful at this stage, and is very simple to build). We can then run our tests against this demo environment (after all, they’re now super fast), and if they pass, and we receive no alerts from our monitoring solutions of increased error rates — then we can deploy this to production.

In case this staging step does fail, we’re now in a situation where master is not releasable — there’s an excellent Slack integration for Jenkins that’ll announce to the team that the pipeline is broken, and the team can take the appropriate action to resolve that.

  • Edit: Checkout the ‘SmartThings’ plugin Henri Cook pointed me to, which’ll allow your notifications to go to any device — office speakers, lighting etc.

Otherwise, if the tests do succeed we can then continue to monitor the released software on production, and if the error rates do increase — trigger a rollback.

I’d be very interested to hear anyone’s feedback on this, and if they’ve implemented any of these components before with success/failure.

Thanks for reading!