Beyond Bare-Bones CI/CD: Refining the Developer Experience

Driven by Code
Driven by Code
Published in
10 min readJun 17, 2019

By: Kyler Stole

In our previous CI/CD and Spacepods posts we gave a high-level overview of our process and tools respectively. In this post we dive into more details of how Spacepods is utilized to enable and orchestrate CI/CD at TrueCar.

As a process, continuous integration/continuous deployment (CI/CD) is meant to ease the effort a developer needs to put into shipping code by removing manual steps. Developing sturdy, automated tests is key to that, but not the only requirement. A great process provides developers with the tools they need to keep things moving, exactly when they need them. As we pieced together our CI and CD workflows, we had to examine how users interact with the tools to deliver an intuitive, dev-friendly system. The investment we put into refining our CI/CD process gives developers better insight and allows us to ship code — from the smallest of bug fixes to the largest feature or refactor — with the utmost speed.

Foot on the Pedal: Continuous Integration

Continuous integration is the more user-focused of the two CI/CD components. The goal is to get code merged to the default branch expediently, and that happens by integrating with the development experience in such a way that the developer never has to struggle through something that could be automated. Our main developer interface into CI is GitHub, where we create Pull Requests (PRs), perform code reviews, and display results from the numerous automated checks that form the backbone of CI.

Keep the Gears Moving

Unlike the CD flow, which is mostly sequential, the CI flow involves a lot of interdependent pieces that run in parallel. In GitHub, devs can open or close PRs; add commits, labels, and comments; or edit PR information whenever they want. Our CI system is constantly reacting to changes in the PR while coordinating various components outside of GitHub like our build system. Showing correct values for each CI check is vital because a failure will delay merging and we want to avoid manual intervention as much as possible.

Messaging is Key

Our repos include checks for code style and quality, automated unit tests, and automated integration tests. Automated checks are great, but messaging that quickly conveys the intent and status of those checks is even better. The effort we put into refining the messaging around these checks to make them as precise as possible allows devs to extract the most value from this feature. Details links are also extremely useful for quickly accessing log messages or results pages for each check.

Let’s break those down a bit…

Title Check CI

This validates the title of the PR. We parse it for a Jira ticket number and enforce that the number connects to a ticket in our Jira board. This is the simplest of all our checks, but it ensures that any work is tracked through our issue tracking software.

Build CI

This waits for the latest version of the PR’s branch to be built as a Docker container image. The image is uploaded to our internal build management and deployment tool, Spacepods, where it becomes a new version of the app. The version is assigned a build number, at which point it can be used in deployments. As new commits are added to the PR, additional builds kick off, which will all end up in Spacepods.

Pod CI

This is where Spacepods really shines in the CI process. When a build is available, it automatically deploys the build to AWS along with dependencies, such as backend APIs for frontend apps, and databases for backend apps. The collection of provisioned resources is encapsulated in a Spacepods abstraction called a Pod. As new builds become available in Spacepods, the Pod is updated with the latest version of code and the CI messaging changes to reflect that.

Note: We don’t deploy a Pod for the PR until someone applies a particular label (e.g. ready for pod). This is solely a cost-saving measure. Devs often create PRs to gather feedback well before they are ready to merge, so the Pod resources are not always needed right away.

Gatekeeper CI

So what was the point of deploying a Pod? Well, in addition to providing a testing environment for devs reviewing the PR and test engineers, it is also the target of automated testing. Gatekeeper is the suite of feature tests and integration tests that must pass. These Gatekeeper tests are rerun every time the Pod is updated. A failure at this stage may indicate that the code will not pass through the QA environment when it reaches the CD flow.

Additional CI Steps

In addition to the CI checks that Spacepods adds, each app repository can add its own steps to Pull Requests. Code climate, linters, and extra automated test suites are all options that show up throughout our repos.

Let it Ride: Continuous Deployment

While the CI flow is supposed to provide tools to aid developers in getting their code merged, CD should be entirely hands off. Once the code is merged, a build of the default branch is created and uploaded to Spacepods. The CD flow kicks off automatically with this new build.

Evolution of a Deploy Dashboard

Spacepods initially contained the base functionality to start deploys manually by selecting the app to deploy, the version, and the destination environment. It was up to developers to deploy their code to the higher-level environments in the proper order (QA, staging, production) and refresh the page to see if Gatekeeper tests passed on each environment before deploying to the next.

This progressed to the beginning of a dashboard to visualize the deploy flow, although still with manual steps throughout. This dashboard displays key information on each environment like the currently deployed version and the latest deploy details. This is possible because Spacepods acts as the sole orchestrator for all deploys. A distinguishing feature of this dashboard is a link to a GitHub comparison with the previously deployed code and a list of contributors since the last deploy. It enhances the deploy flow because deployers can easily visualize test results for each environment and a button to “Promote” to the next environment appears as active only when tests have passed. If this refreshed automatically, it would actually be a pretty decent start at continuous delivery.

Okay, so changing continuous delivery to continuous deployment is just about automating the manual steps in the process, right? Essentially… But that does not come at the flip of a switch. It requires confidence in the deploy procedure and the automated tests. Whereas the continuous delivery path allows test engineers to validate results along the way, continuous deployment requires that any manual intervention be complete by the time the CD pipeline executes. It also requires that the process can cope with failures and unexpected behavior in an elegant way.

Visualize the Deploy Journey

Since developers are not actively deploying their code, it is important to change visualizations from an environment-by-environment view to a more all-encompassing dashboard. We completely rebuilt the deploys dashboard to focus on each app’s CD pipeline and make it clear to developers where their code is running. The best part? It refreshes automatically!

Technical Note: We implemented the automatic refresh functionality using HTML5 Server-Sent Events and Redis Pub/Sub on the server. This is much easier on the server than a polling design because we can react to changes rather than frequently querying the DB for all the different components of the dashboard. We considered Rails 5’s Action Cable, but it was still in its infancy at the time and WebSockets were not supported by the AWS Classic Load Balancers we were using (we now use Application Load Balancers).

On Deck / In Flight / On Prod

This is the most high-level information about an app. On Prod just shows the current build on production and On Deck shows the version that is waiting to go through the pipeline, each linking to a build detail page. In Flight is the most useful component because it shows the code currently deploying through the pipeline and the people who contributed (since the last production deploy).

Pipeline Controls

Remember that the process has to handle failures and unexpected behavior. This is one of the places where that really comes into play. Controls give privileged users the ability to pause and resume the pipeline at any time. If something unusual happens, someone can pause the pipeline right away while they investigate. This ensures that the code will not be deployed to the next environment. Additionally, the pipeline is paused automatically in the event of a deploy or Gatekeeper failure.

Additional controls allow users to take correct action depending on the nature of the failure. An automatic pause essentially means the pipeline got into a bad state. There are a few reasons for this. It could be that a deploy or Gatekeeper had a one-off failure, in which case, the user can redeploy or rerun the Gatekeeper. If something went wrong with a deploy, causing the Gatekeeper to fail, then a restart of the entire pipeline may be required. Finally, if the failure is due to bad code and a code change is needed to correct it, the user can flush that build from the pipeline and move on to the On Deck build.

Pipeline Visualization

The pipeline visualization is an interactive component that offers a great way to see the full picture of an app’s status on every environment. It shows the most important environment-specific information from the previous deploy dashboard, such as the currently deployed build and a url for the app running on that environment. It also provides links to pre-filtered logs for easy debugging. Most importantly, it depicts each stage of the CD pipeline, showing when a deploy or Gatekeeper is in progress, failed, or succeeded.

Notice where this pipeline forks from QA to Prod and Staging? All our pipelines were sequential when we introduced our CD flow, with QA, staging, and production environments in that order. Once we gained confidence in our automated tests and updated configurations to make each environment analogous to the production setup, we no longer needed to run tests on two separate environments before deploying to production. Some apps have eliminated staging from their CD pipelines entirely, but others still require it to stay up to date for user acceptance testing, so it remains a non-blocking stage of their CD pipelines.

Technical Note: This pipeline visualization is built with pure HTML/CSS. The dashed lines connecting environment stages are drawn dynamically with custom JavaScript using Bézier curves in an SVG.

Past Pipeline Executions

Controls are good when issues are discovered in the moment, but what about dealing with issues that slip through? Unfortunately, even the best systems are occasionally infiltrated by uncaught mistakes. That is when the ability to go back and check historical deployments is very useful. This component has links to all the previous deploys and Gatekeeper results to fulfill that purpose. It also shows where executions were flushed or restarted.

Key Concepts for Usable CI/CD

We were a tad apprehensive when we first enabled CD for our apps, and we did have to smooth out some rough patches with our CI tools, but we avoided any major snafus in the early stages and we have deployed much more code as a result of our CI/CD process. Refinements along the way led us to a positive developer experience and informed a few key concepts that any CI/CD system should consider.

  • Rely on automated tests. CI and CD cannot function without this.
  • Failures in the process lead to delays. Individual test failures are helpful to developers and rarely require attention from others, but inadvertent failures in the CI/CD tools require infrastructure engineers to resolve and cause delays for all developers.
  • Visualize the process. Providing visibility to developers keeps them more involved with the process, so they will notice quickly if something goes wrong. We do this through GitHub, Spacepods, and Slack.

Future Enhancements

CI/CD is an evolving concept and we aim to improve it along the way. We have made speed improvements at several stages that impact both CI and CD — most notably, with regard to builds, deployments, and automated test execution. We also have efforts under way to increase stability in several areas. Our next major initiative is to improve automated monitoring, which is not strictly a CI/CD component but takes on more importance in a CI/CD world.

We are hiring! If you love solving problems, please apply here. We would love to have you join us!

--

--

Driven by Code
Driven by Code

Welcome to TrueCar’s technology blog, where we write about the interesting things we‘re working on. Read, engage, and come work with us!