The Journey to CI/CD

By: Chad Rempp, David Giffin and Joshua Go

This is the first in a multi-part series dedicated to explaining the “why” and “how” TrueCar moved to true CI/CD and away from Sporadic (whatever your favorite antonym of continuous is) Software Development.

It’s winding…but worth it.

For a while, it seemed like everyone was talking about moving fast and breaking things, but what should you do when you move slowly and still break things?

In modern software development, continuous integration and continuous deployment (CI/CD) are key pillars of moving fast, and there’s little debate that a well-executed CI/CD process will result in faster iteration and a more flexible development process. So why have so few companies fully completed the transition to a CI/CD process?

Because it’s hard. Not in the sense that CI/CD is complicated as a concept, but because the change involves a lot of work in different areas around process, tooling, and organizational alignment.

At TrueCar, we have finished transitioning from a burdensome weekly release cycle to deploying code up to 100 times per week. In this series of posts, we will walk through the problems we encountered and how we addressed them. We’ll cover the evolution of our release processes and development environments, as well as what we did when building, testing, and deploying releases. And when code starts running in production, it needs to be monitored, so we’ll discuss what we did to improve visibility.

The effort was a coordinated and cross-functional one that involved developers, QA, and infrastructure teams. But most importantly, the transition required buy-in across the organization, including the full support of our senior leadership team. We’ll also share the results of this change and where we plan to go from here.

The Beginning

In 2014, TrueCar went public. It was a big moment for us, the culmination of years of hard work and fast-paced decision making. All of the scrambling over the years, however, resulted in a significant amount of technical debt with accompanying processes that would not scale with the growth of the company.

Shipping code was a painful process. A release cycle took weeks to complete. Simple bug fixes took a laborious two-week route to production, and the releases themselves were fragile and inevitably introduced new bugs. On top of that, the manual process of deploying often failed.

We faced many foundational issues that prevented us from quickly and efficiently getting our code to production.

  • The accumulated complexity and fragility meant that excessive coordination was needed: 30 people in a room, eight days before each deployment, figuring out the right order of dependencies for 12 code bases, depending on the nature of what was contained in a given release.
  • There was a single shared integration environment for testing any changes, which meant that if a database or backend service broke in some way, testing would come to a screeching halt for all other teams.
  • Builds were packaged up as RPMs to make system-level changes in the virtual-machine based deployment environment, even as the rest of the world moved on to containerization.
  • There were no automated integration tests covering our consumer experience, so testing was a manual process that took place in a single QA environment over the course of a few days.
  • Once QA signed off on a release, a member of the production engineering team would manually deploy the RPMs to production in the order determined during the coordination meeting.
  • Due to the lack of centralized monitoring, we’d often hear about problems in a production release from users, partners, or dealers before being alerted ourselves.

Given all of these problems, we knew that something had to change. Whenever possible, we addressed the issues in each area so that improving in one area would incrementally improve the overall situation. We implemented blameless post-mortems after every production issue to encourage everyone involved to clearly identify the actions leading up to an incident. We created remediation items to add more monitoring, adjust process, or fix issues in production code.

Coordination

We needed to shift the culture from waterfall development to a more agile devops approach. Developers and product owners shouldn’t have to spend their time worrying about deployment orchestration.

Before CI/CD, the release cycle began with Change Management (CM) tickets, which each team was required to file eight days before the deploy. Product owners and development teams totaling about 30 people would meet to discuss the current set of releases every Wednesday at a CM coordination meeting. In this meeting, we would determine the order in which each of the 12 code bases would be deployed to avoid errors and dependency failures. Why the eight-day lead time? Each team needed time to determine the impact and dependencies of their work. Releases would only happen on Thursdays. If we needed a deployment outside of the normal cycle, we would have to file a CM Exception. The amount of overhead and complexity involved in this process meant that product managers would spend significant portions of their time project managing their releases. Building things was no fun.

The first thing we had to change was this process. It was more difficult than it may sound — in fact, it was the most difficult part of the whole endeavor. The challenge came from the culture that had developed around the process, and not the process itself. At that time, TrueCar didn’t understand why continuous deployment was needed. There was complacency with the status quo.

We needed to shift the culture from waterfall development to a more agile devops approach. We wanted developers and product owners in charge of their own destiny without having to slog through a heavyweight process. We wanted to live in a world where change management and software dependencies were baked into the deployment tooling. Developers and product owners shouldn’t have to spend their time worrying about deployment orchestration.

CI/CD is our new coordination process. There are no more meetings to facilitate dependencies, no more deployment schedules, and no more lead time. In a future article we’ll discuss the CI/CD process at TrueCar in detail.

Development Environments

We wanted developers to be able to deploy code on their first day at TrueCar. We saw the need for ephemeral environments where a developer could get their own instance of TrueCar and developers wouldn’t be stepping on each other’s toes.

In 2014, TrueCar had a single shared QA environment that created a major bottleneck for deployment. QA and developers were often stomping on each other’s toes. If one person or team broke the database or a backend service, testing would come to a screeching halt for all the other teams. We had no fixture data in our environments, so we would do a manual, ad hoc pull of some data (like a subset of dealer data) from production, but inevitably miss pulling in something important. This caused data consistency problems and referential integrity issues: something as simple as knowing which dealers were active for testing was a manual process.

We wanted developers to be able to deploy code on their first day at TrueCar. We saw the need for ephemeral environments where a developer could get their own instance of TrueCar and developers wouldn’t be stepping on each other’s toes.

Spacepods became the internal product we used to provide ephemeral development environments to all of TrueCar. We call each environment a “Pod.” Developer Pods have all of their own AWS resources, which are created using Hashicorp’s Terraform. (We’ll dive into the details of Spacepods in a future series of posts.)

Within Spacepods, there’s the concept of Master Pods, which are long-lived instances of our pre-production and production AWS accounts corresponding to key steps (development, QA, staging, and production) in the deployment pipeline. As part of the deployment pipeline, Spacepods also handles deploying and testing our code on top of the AWS infrastructure.

In short, Spacepods was instrumental in eliminating the issue of developers getting in each other’s way.

Builds

Next, our focus turned to the build process. In the legacy systems, deployment began by building a Red Hat Package Manager (RPM) package. The RPMs allowed developers to add scripts with additional commands that could be run before and after deployment.

The purpose of this process was to prevent the production engineers who deployed the code from having to perform additional steps during deployment. However, many deployments still required production engineers to perform manual steps after the code was thrown over the wall for deployment.

One side effect of using RPMs we discovered was that all code was completely removed from the server during deployment. This meant that the service was required to completely shut down before applying the new code, making the deployment process even longer.

When we began our replatforming effort, we decided to evaluate the best way to package code in the new environment. We looked at various ways to do this, including creating custom Amazon Machine Images (AMI) that could easily be run on EC2 instances in AWS. There was Docker, full of promise in how it would change the way code would be packaged, deployed, and run; but it was still new and had a few rough edges.

In 2014, we attended DockerCon. At the conference keynote, the speaker asked the entire audience for a show of hands: who was using Docker in production? Only two people in the audience raised their hands.

Still, we had a hunch that Docker and the container-based approach were the way to go, and that any minor issues that arose could be fixed. Later that same year, AWS announced the preview of EC2 Container Service (ECS) to manage Docker-based applications at re:Invent. So Amazon was in, which was a meaningful show of support for this new technology.

And so, when we built Spacepods, we made the decision to build on top of both Docker and Amazon ECS. The Docker images we built would run exactly the same, byte-for-byte, in all of our environments: ephemeral dev, pre-production, and production. The ease with which a developer could pull the exact same Docker image that they’d run in production, and run it on their laptop with all the system dependencies already packaged up, was very much in line with how we wanted to empower our software engineers. Today, we happily run Docker in production, hosting many development stacks, including Java, Ruby, Python, and more.

Testing

We realized that automated testing needed to be a part of our culture and development process.

Our next task was to build out robust, automated testing. Our previous method was manual and cumbersome. After the package was built, the release candidate would be handed to our QA team. Up to this point there had been little to no testing of the code. Developers had no unit tests, functional tests, or integration tests to verify that their code worked properly before it went to QA. The QA team (separate from application engineering teams) would deploy the package to our single QA environment and manually test the features for a few days. If something went wrong, QA would go back and forth with developers until a release candidate was built without defects. The QA process was completely manual at the time, clicking through various partner sites to verify changes. Around 15 teams were running through this cycle all the time on a single environment.

We realized that automated testing needed to be a part of our culture and development process.

When we decided to build Spacepods, we developed a hook that would spawn our new testing framework, Otto, which was created by our QA team to automate the manual testing processes that had been repeated over the past 11 years. This was handled through Gatekeeper, which managed running tests for each application that implemented its API. If it failed, the build would not be able to progress to the next environment.

Deploy

Next we focused on our cumbersome deployment process. Previously, once a build was approved by QA, the release candidate for each code base was passed to the Production Engineering team for deployment on Thursday. The order of execution was determined in the CM coordination meetings, we hoped correctly.

The production engineers would follow the steps outlined in the CM tickets, including any additional commands not included in the RPM required to complete the deployment. If everything went well, we would have a functioning site with new features and bug fixes. It often went wrong.

We decided to start automating the manual deploy process. We couldn’t do it without visibility into what was happening, so we created an internal tool called Viewmaster. Viewmaster started as a project to inventory all of the virtual machines that were powering TrueCar. At the time we had over 4,000 virtual machines across our development, QA, staging, UAT, and production environments.

We developed an agent that would report the version of software that we deployed to a virtual machine. The agent essentially served up a JSON file with the Git commit and RPM build information from the Jenkins job. Once we had an inventory of which software was running in the various environments, we created a web interface for Viewmaster to allow QA to take over deployments with the click of a button.

As we began our migration to the cloud, Spacepods became our deployment tool for all things in AWS. Spacepods was built upon the many lessons we learned during our years of manual deployment, as well as the new insights gleaned from Viewmaster about providing a self-service interface for deployments. Eventually, Spacepods became the tool to enable continuous deployment at TrueCar.

Monitoring

Our final area of focus was around monitoring. Originally, there was minimal oversight of the deployment, and even less monitoring, after it made it to production. This was due to the siloed nature of the teams and a lack of clear accountability and shared responsibility.

There was no centralized logging, monitoring, or alerting, which meant that after a deployment, it was often hard to tell if things were working or not. Developers and product owners had no insight into how their code was working in production. Issues were typically caught on the weekend by our users, partners, and/or dealers. Due to the slow, complicated, and burdensome deployment process, the only option in these situations was to revert to the previous release of the code and start the same process over again

Historically, monitoring and alerting was all done in-house. We ran our own Elasticsearch, Logstash, and Kibana (ELK) stack for logging, Sentry for exceptions, and Nagios for monitoring systems. All of these tools had to be managed in-house and would themselves have issues during outages. For instance, ELK often crashed from executing long running queries in Kibana, which was hardly surprising given the nearly 500GB of logs produced by the data center each day.

Given the headaches associated with our in-house monitoring solutions, we started to look at third-party SaaS vendors to provide deeper insights into our development stack. We decided that monitoring was something we preferred not to build or manage ourselves. After all, we’re in the business of creating amazing car buying experiences, while there are many third-party vendors entirely focused on the messy business of application and infrastructure monitoring.

Summary

Implementing CI/CD the right way vastly improved our ability to deploy code safely and easily. Along the way, we learned a number of lessons that were critical to our ultimate success:

  • Aim to reduce manual coordination overhead.
  • Let developers experiment fearlessly, without worrying about messing things up for someone else.
  • Pick a well-supported build artifact format that meets your needs. (For us, it was Docker.)
  • Create automated tests to increase confidence in each release while reducing the amount of manual work. Find out sooner when something breaks.
  • Automate deployments.
  • Centralize monitoring and avoid running the monitoring infra/tooling yourself if you can. (That’s undifferentiated heavy lifting.) Again, find out sooner when something breaks.

Looking forward, our CI/CD series will dive deeper into the details of continuous integration and deployment at TrueCar.

We are hiring! If you love solving problems please reach out, we would love to have you join us!