As we mentioned in our introductory post about our journey to CI/CD, in 2014 development velocity at TrueCar had a lot of room for improvement. Our heavy release coordination process and the need for manual deployments, in particular, created extra work for everyone and were major limiting factors in how quickly we could ship code. We knew there had to be a better way.
Before we could really address these foundational issues, however, we had to understand the realities of our environment and verify that our assumptions about key prerequisites were valid. Once we laid the groundwork, we brought a wider group into the early efforts and let people benefit from incremental improvements as soon as they were ready. This created a positive feedback loop that helped us further improve our tooling and processes.
In this post, we will cover our approach to the following areas:
- Identifying constraints. These were the constraints that not only limited what we could do, but how we could do it. In our case, we identified two overarching factors: a very locked-down network architecture and a change management system made necessary by the accumulated technical complexity.
- Testing assumptions with incremental improvements. To streamline deployments the way we wanted to, we had just three basic requirements: the ability to communicate across our locked-down network, a way to know which applications were running, and the ability to integrate with our version control system.
- Building a deployment UI. With our basic prerequisites validated, we built our deployment UI. We chose a name, implemented automated change management, and required that basic tests pass for each application being deployed.
- Expanding our internal user base. Once the core tool was built, we socialized the effort. We opened up production deployment access to members of our QA team, created a chatbot to query build information from Slack, and made it easier for developers to integrate their applications.
Addressing the issues in this way led us to create various internal tools, which brought much needed transparency to the development and deployment processes. We learned a lot about the building blocks necessary to get to CI/CD. And although we never got to continuous deployment before leaving our data center, we did manage to make major strides toward it.
The Easter Egg
In the beginning, there was an Egg. A concept called the “Easter Egg Operating Environment” was created to segment internal and external applications from talking to each other using stateful firewalls. Normally, Easter Eggs are fun surprises; this was a surprise, but it definitely wasn’t fun.
The Easter Egg slowed down development by forcing application developers to request ACL rules for their applications in order to talk to the resources they needed. Each environment — QA, staging, and production — had separate firewall rules. In lower environments (dev and QA), the rules were wide open, allowing any application to talk to any other application. Staging and production environments were more restrictive, which meant that developers would often push new code to talk to a new resource or application and due to inconsistencies between environments, they wouldn’t notice an issue until their code reached production. Tickets had to be created for all ACL rules, with each rule requiring executive approval.
The Easter Egg was simply a network made up of 3 zones. Each zone was a full class B network. The zone definitions were:
- WC (10.5.0.0/16): corporate zone in the western region
- WB (10.4.0.0/16): back end processing zone in the western region
- WR (10.3.0.0/16): front end web application zone in the western region
Each zone was broken down into VLANs. Each VLAN that existed in each zone was a class C. For example, in WC, 10.5.192.0/24 was called the 192 VLAN for WC. Within a zone, we had ACLs that permitted or denied specific class C VLANs within that zone from communicating with the others. For example, we may allow TCP port 50012 communication between VLAN192 and 193. Also we may allow TCP port 1433 for MSSQL from 192 and 192 to talk to the database VLAN17. WC is not allowed to initiate TCP or UDP packets to WB or WR. WR is not allowed to initiate TCP or UDP packets to WB or WC, except for WR VLANs 500 and 501. WB, however, is allowed to initiate TCP sessions to WR and WC.
Now that you understand all of the rules of the Easter Egg, doesn’t it seem easy to build an application in this environment? We didn’t think so either.
Managing change in the data center
As TrueCar grew and developers created more applications and services, knowing what was running in the various environments became tribal knowledge. Coming into TrueCar as a new engineer, you had to understand the current state of the firewall, Puppet, and Nginx configurations. You also needed to understand the “Template Values” system. Template Values were environment variables that would be injected into an application at runtime via the RPM service startup process. They also describe the connection parameters from a given application to all of its dependencies.
All of this tribal knowledge and the various system dependencies meant that TrueCar needed a process to help coordinate the effort to get many code bases into production on the same day (Thursday deploys). We created a change management process using Jira (CM tickets). Change management became a very heavy development process that extended to product owners, developers and even executives. Every day at 10:30am, we would have a CM coordination meeting to talk about all of the changes that were going to be deployed on Thursday.
We realized that we needed to simplify and automate the change management process, and to automate it, we had to be able to talk to our version control system.
Testing Assumptions with Incremental Improvements
With the complexity of our network architecture and the cumbersome change management process in mind, we asked ourselves what we would need at the absolute minimum to streamline our deployment process. It came down to two things: communicating across the Easter Egg and pulling information out of our version control system.
PacMan and Lite-Brite
Our first attempt at communicating across the Easter Egg was a content management tool. TrueCar powers roughly six hundred white labeled car buying sites. We created an application called PacMan (Partner Asset Content Manager) as a content management system for partner data. PacMan needed a way to publish its data through the Easter Egg network into the various environments.
We took Etsy’s open source tool called deployinator and rebranded it internally as Lite-Brite. This was our first attempt at creating a user interface for deployment at TrueCar.
Lite-Brite was responsible for traversing the Egg to make this possible. Lite-Brite copied the partner configuration data from PacMan to the various environments: QA, Staging, UAT, and Production.
With Lite-Brite and PacMan, we simultaneously validated our assumption that it was possible to deploy something across the Easter Egg, and saved developers from having to check in and deploy code just to make simple changes to website copy.
Appinfo: creating a view into our data center
One of our new “Production Engineers” (devops, site reliability engineers, etc.), who we hired in 2014, wanted to inventory all of the hosts running our production workloads. The concept was straightforward: create a very simple service that would run on every virtual machine and report the version of software running.
We developed a service written in Go that we called Appinfo. It read a JSON file from disk and served it via HTTP. This allowed developers to include a JSON file into each RPM package that documented the build details: Git commit hash, which of our ten Jenkins build servers was responsible for building it, and so on. Here is an example of the results returned from the Appinfo service:
Security is of the utmost importance for TrueCar, so setting up firewall rules to allow for HTTP connections to the Appinfo service for propagation across the Easter Egg network was a challenge. We needed to convince the executive team that having visibility and understanding around what ran in production was more important than upholding the Easter Egg rules.
The next problem was finding a place to store the results from each of the 3000 hosts that served TrueCar applications. We convinced the security folks at TrueCar that Redis could be used as a queueing service and data store to share data across the Egg.
We created a service to hit our internal DNS servers, grab all of the DNS entries, and store them in Redis. Then, we could traverse all of the DNS entries and look for the Appinfo service. When we encountered a host with the Appinfo service we took the JSON file and stored it in Redis. We also broke down the DNS information to understand which environment and region the code was running in, and appended that information to the JSON file received from Appinfo.
Soon, we were able to build a Rails application around the Appinfo data that we had mined from all of the servers in the data center. With Appinfo, we not only improved visibility into what was running, but also had another useful tool that communicated across the Easter Egg.
Version control and change management
In 2014, TrueCar had many version control systems: Git, Subversion, Perforce and sometimes no version control at all. We needed to consolidate version control systems to provide better visibility into changes happening to the systems, streamline deployments, and automate the change management process.
We started by eliminating the version control systems that we didn’t want to support going forward. On December 1, 2014, our Perforce version control system was set to readonly and all code bases were migrated out of Subversion and Perforce into Git.
Our assumption that it would be possible to talk to our version control system to build change management tickets in Jira held up; GitHub and Jira both have well-supported APIs. It’s just that first, we needed to have everything in GitHub to avoid writing additional adapters for each version control system in use. By consolidating into one VCS, we improved the coordination process in the short term even as we ultimately still had our eyes on more extensive automation.
Building a Deployment UI
After struggling with the deployinator code base to make it copy Redis data across environments and realizing that deployinator was built specifically around Etsy’s deployment model, we decided to build ViewMaster to replace Lite-Brite (our internally branded version of deployinator).
ViewMaster was designed to automate TrueCar’s deployment process. It cataloged all the hosts in our data center and then handed off the deployment process to the QA team. ViewMaster kept track of different versions across different environments so QA engineers could deploy code to a QA/staging environment, test the code changes, and deploy the same code to production with just one button. Another key feature of ViewMaster was the ability to automate our change management process by creating/updating change management tickets in Jira to track code changes.
At this point it was almost trivial to recreate the deployinator user interface with a deploy button for each environment.
Automated change management (Auto CM)
Auto CM worked by letting ViewMaster know about the application’s Jira project key, Slack channel, and product/development/test owner. In order for ViewMaster to automate our change management process, engineering teams had to adhere to TrueCar’s Gitflow. ViewMaster parsed the branch name of the latest release branch to populate the branch name, release number, and the list of pull requests on the CM ticket.
We built our own version of the Gitflow process. To summarize, we had two main branches: master and develop. Master branch was always clean and ready to be deployed at any time. Develop branch was the latest delivered development changeset and was queued up for the next release. We also had release branches, which reflected a future production deploy. Release branches were branched off of develop and consisted of a collection of feature branches. Once the release branch was tested and ready for release, it was merged into master and then tagged to be deployed to production.
ViewMaster had a cron job that polls GitHub for new branches. When a release branch was created, the CM ticket transitioned to a “release” status. ViewMaster opened a change management ticket in Jira and parsed the branch name to use it for fix version (a name and date to which we linked stories and bugs), branch name, build number, and so on. ViewMaster also parsed git commits and added links for pull requests on the CM ticket. This process was logged and displayed for anyone to see.
Once a deploy finished, ViewMaster updated the CM ticket with a link to the deploy log along with the status of the deployment. The CM ticket was closed after the release was verified in the production environment. Product owners and developers could also track the history of all CM tickets for their application in ViewMaster.
Here is a screenshot that shows a CM ticket with information populated by ViewMaster:
ViewMaster could also invite a Slack bot to the team’s channel. Engineers could ask the Slack bot for details on CM tickets that had been created on their behalf.
ViewMaster to enforce testing
We needed to augment the requirements for a given application or service. TrueCar historically required endpoints for “internal/health”, which were used by our load balancers (Nginx) and “internal/app-health”, which tested that external services required by the service were up and running.
We needed to add smoke tests to the required endpoints. However, “smoke tests” at TrueCar referred to processes that the QA team executed manually after each deploy. They weren’t a set of minimum tests that could be executed in production to verify a service would run successfully.
So, we internally branded smoke tests as “self tests” and got all of the engineering teams to implement these self tests before they could use ViewMaster to deploy code.
Self tests ended up being the first set of tests that were required to pass in order to deploy code. We used this as a forcing function. Here are the requirements for the self test endpoint:
- This endpoint should be located at “/internal/test”
- The total run time for all tests (passing or failing) must not exceed 60 seconds
- The endpoint will return HTTP status code 200 for success or 500 for failure
- Each test must be timed in milliseconds
- Each test must output “PASS:” or “FAIL:”
- Each test must output a description of the test
- Each failing test must output an error as part of the description
- The last line of the output must report the overall status and total run time
- If any test fails, HTTP status code 500 response is returned and the test suite status output is “FAIL:”
- If all tests pass, HTTP status code 200 response is returned and the test suite status output is “PASS:”
Expanding Our Internal User Base
Deploys taken over by QA
Production engineers were responsible for all deploys before ViewMaster. They would get a ticket with files to change and a list of commands to run. If something failed they had to work with developers to fix the problem. ViewMaster allowed our QA team to own the deployment process. They could press a button and deploy new code without production engineers.
ViewMaster integrated with Slack and allowed product owners and development teams to get updates when code was being pushed to production. We also invested in centralized logging and monitoring by building out Elasticsearch and Kibana for logging and StatsD for monitoring. These tools provided developers with insights into production issues.
We became comfortable with pushing code into production using our new tooling and decided to double our weekly deployment throughput by allowing development teams to release code two times a week on Tuesdays and Thursdays. Eventually, by adding automated testing to our deploy process, that limitation went away and we were able to push code to production all hours of the day.
We wanted to make the process of creating Self-Tests as easy as possible. To accomplish this, we created a Ruby gem that allowed developers to write simple pass/fail tests that would be exposed on the “/internal/test” endpoint.
An example of defining some tests:
The “/internal/test” endpoint was polled after a deploy to verify that the Self-Tests passed. The response would include a description and status of each test as well as a status code of 200 if all tests were passing or 500 if any test failed.
RUOK allowed all development teams to quickly and easily implement the Self-Test requirement, which gave them the ability to start using ViewMaster for deployments.
TrueCar adopted Slack as our chat communication tool. We had heard about some of the interesting things that Etsy had done with deploy queues and IRC. So we decided to create a Ruby-based bot that would interface with ViewMaster. This was the start of chat ops at TrueCar.
KITT provided a command line into the data center. You could query by host and understand what versions of the software were running on those hosts.
KITT was a hit. It allowed engineers to understand what code was running in every environment. Before this, people had to SSH to a server in an environment and tell others which version of the code was running. Most engineers didn’t have SSH access.
With KITT, someone could ask in a Slack channel and everyone was exposed to the same information.
We also integrated KITT with PagerDuty so that anyone could send a message to @oncall and it would notify the correct engineer who was on call.
Locking deploys after work hours
The freedom to deploy at any time posed a new problem for TrueCar. A production engineer would have to roll back a deploy if a deploy with bugs went out late at night or on weekends. In an effort to respect everyone’s time and personal life, we locked down deploys to only happen between 9am-6pm Monday-Friday.
Looking to the Future
ViewMaster was a one stop location for developers to get deployment status, logs, currently deployed versions in different environments, and URLs for application monitoring and metrics. Anyone could go to an application page in ViewMaster and see the contact information for the application owners and to which Slack channel notifications were sent.
After experiencing push-button deployments, increased visibility into everything we ran, and broader access to tooling, we wanted to do even more. But there were still unsolved problems:
- Not all environments were available for deployment on ViewMaster. We had three separate QA hosts for each application but only one host was deployable through ViewMaster. If a QA engineer wanted to deploy to a QA host not in ViewMaster, they had to SSH into the host, get the RPM build information from Jenkins, and then manually run the deploy script with correct arguments.
- We also had problems with engineers sharing a limited number of hosts for testing. Engineers had to check the current version on the host they wanted to use, then ask in their team Slack channel if it was OK to deploy a new version.
Ultimately, these were not problems we could solve within the constraints posed by the limited hardware we had in our data center, and would have to be addressed by Spacepods and our move to the AWS cloud.
Many of ViewMaster’s useful features would carry over in Spacepods, but even the process of building and operating ViewMaster — coming to terms with constraints, testing important assumptions early on, creating usable UIs, and empowering the wider organization with the new capabilities — would go on to inform how we built the tools that came after it.
Baby Step to CI/CD
ViewMaster showed that incremental improvements to our deployment process could be worthwhile, even as our network architecture and change management requirements remained in place. If we didn’t have full automated tests for legacy applications, we could at least create self test endpoints so there was some sort of check. If we still had to keep track of releases with change management tickets, then at least we could automate the process of creating and updating them. If we still had to have QA manually test and sign off on releases, we at least made it so they could deploy the releases after sign-off without having to wait for a production engineer to be available. When the savings in time and effort are obvious and substantial, being able to only go halfway doesn’t seem so bad.
Thanks to Eric Slick and Kyler Stole for reading drafts of this post.