Capsela: What Was It, and Why Did We Do It?
By: Joshua Go
The story is familiar to any young company fortunate enough to survive and thrive beyond its early years: build fast, find a way to get things done, and jump on opportunities before the market moves on.
This mode of operating works for years, until it doesn’t. And when it doesn’t, building isn’t fast anymore. Getting things done somehow feels much harder than it used to. All the opportunities that worked out and made you successful? Those became your core business, which means more systems to keep running, and it sure doesn’t help that you had to make a bit of a mess along the way. Meanwhile, the world continues to change, and you find it increasingly hard to keep up.
TrueCar was no different. From 2015 to 2018, we undertook a company-wide effort to solve this for ourselves. Our answer was Capsela, a technology replatforming initiative that massively increased our development velocity and significantly improved our ability to respond to change.
Before Capsela, we were severely limited by the following:
- Fixed capacity. We had our own physical servers set up in a datacenter colocation facility. Bringing up additional capacity was a months-long process involving purchase orders, rounds of sign-offs, sending someone in to physically rack up servers, and configuring the servers once they were up. The fixed capacity also meant we had only a small number of development and QA environments, which bottlenecked the development process by limiting the number of initiatives going on in parallel.
- Fragmented but intertwined codebases. Major feature releases often meant making code changes in at least four different codebases, written in different programming languages (Python, Java, Ruby) and different software frameworks (Flask, Django, Spring, Rails). This introduced considerable overhead in release planning, sequencing, and coordination. Not surprisingly, it was extremely difficult for developers to switch projects.
- Calcified business logic. Much of our core business logic was described in database stored procedures. This made website functionality and our data pipelines very difficult to understand, test, and extend with new features.
- Delays in data processing and publishing. Our image processing pipeline for refreshing the vehicle images shown on the website could run only once a week, and even enhancing window sticker data for our inventory of vehicles suffered a day-long lag. Publishing data about new models and styles was also a week-long process involving a team of five to seven people working in concert. In the week (and sometimes more) between when a new model was available for sale and when it showed up on our site, we would miss out on consumers looking for those new models.
- Frequent outages. During traffic spikes (and sometimes for reasons we could only guess at), we would experience frequent outages in which we had little recourse but to ride out a traffic surge or restart services. This was a result of both our fixed capacity and the difficulty of reasoning about the performance of legacy code and database stored procedures.
We felt the drag on our productivity daily, and it was very clear that these issues were slowing us down.
We addressed the issues we identified by moving our infrastructure to the Amazon Web Services (AWS) cloud, rebuilding our software using a standardized technology stack, adopting modern DevOps practices, and deprecating the legacy systems that had been built piecemeal over more than a decade.
Moving our infrastructure to AWS shortened project lead times by reducing the need to order, approve, install, and configure physical hardware in a datacenter. What used to take several months could now be done in a few minutes. We also built tooling to eliminate the integration environment bottleneck to allow more developers to stage and test their changes in parallel, without worrying about stepping over each other in shared environments. Outages due to capacity constraints became a thing of the past with our horizontal, scale-out cloud architecture.
Rewrite and Standardize
Rewriting our software in a standard, unified technology stack addressed the major fragmentation issues and increased organizational flexibility. We chose Ruby on Rails, React, Redis, and PostgreSQL to build our new experience. To make it much easier to reason about, plan, and sequence releases, we built a monolith for our core auto buying platform (ABP) where we could focus and direct the best of our development efforts. For our data warehousing and reporting needs, we chose Amazon Redshift.
We also overhauled our data pipelines and vehicle data publishing tools. To feed our data pipelines, we standardized on Apache HBase to rein in various datasets that had previously lived in Hive, various Avro files, and Microsoft SQL Server. To run faster, more responsive data pipelines, we incorporated Apache Spark and Amazon Kinesis. This new, standardized data stack enabled significant new capabilities such as 1) allowing dealer customers to adjust pricing on their inventory and reflecting that to consumers in near real time, and 2) updating a new inventory record with window sticker, options, and features data as soon as a new vehicle was added, rather than having to wait an entire day for the full pipeline run as was previously the case. The frequency with which we were capable of publishing new images and new vehicle data went from weekly to several times a day.
As for the database stored procedures powering the business, we moved that core business logic into application code where it could benefit from automated tests, which broke the calcification and enabled future iterative development.
Adopt DevOps and CI/CD
Adopting modern DevOps practices gave us the confidence we needed in the new platform. We introduced far more comprehensive and consistent automated test coverage, which enabled us to improve even further when we introduced continuous integration and continuous deployment (CI/CD). We also insisted on monitoring and alerting with modern tooling, and put a plan in place to incrementally refine our monitoring and tune our alerts.
Retire Legacy Systems
None of the above would have been worth the effort if we still had to somehow prop up our legacy systems. Shutting down our legacy systems reduced unneeded complexity and freed up vast amounts of organizational energy. Circular dependencies and shared data stores between API services made it very hard to scope the level of development effort, assess the risk of a change due to complexity, and determine the proper sequencing needed for a given release. Whenever we were finally able to replace functionality from a legacy service, we quickly shut the legacy service down and cut everything over to the new platform.
As far as putting some numbers around all this, many of the key benefits can certainly be quantified. Among these are:
- Number of production deployments per week (a direct measurement of development velocity): from one release every two weeks to 80 deploys on a typical week.
- People needed to push a trivial code change to production: from 10 to 1.
- Codebases under consideration when evaluating a typical change to the ABP: from 7 to 2.
A lot of the benefits, however, go beyond the numbers and are deeply felt at TrueCar, where everyone works daily to enhance the car-buying experience.
- We have more confidence in our changes and less anxiety that a given code change will break something.
- We have the flexibility to quickly shift engineering focus in response to business needs.
- We don’t need to worry about database backups since we use managed services.
To read about this all in more detail, including the reasoning behind many of the decisions and challenges we faced along the way, feel free to explore these posts:
- The Journey to CI/CD
- Spacepods: Beyond the Cloud
- ViewMaster: Early Steps in Our Journey to CI/CD
- YAMVM — Yet Another Monolith vs. Microservices
- Beyond Bare-Bones CI/CD: Refining the Developer Experience
- Test Automation in CI/CD, Part 1
- Test Automation in CI/CD, Part 2
- Amazon Redshift for Data Warehousing: Migration, Implementation, and Improvements
We are hiring! If you love solving problems, please reach out. We would love to have you join us!