This is Part 2 of the series of blog posts about the test automation in TrueCar. In the first post, we looked back at our old test frameworks and the testing process in the legacy applications workflow. We also gave an overview of Otto, the test automation framework that we built to support the transition from legacy to Capsela platform.
In this article, we’ll focus on how we integrated Otto into the CI/CD process. We’ll go over the challenges we’ve faced during that process, explain why we decided to move to Cypress, and how this decision has improved our development and deployment process.
First steps to CI/CD
As mentioned, Otto is our test automation framework that goes back to 2016 when we started our migration to the new technology stack with Ruby on Rails for the backend apps and React for the frontend. Otto is a monolithic codebase that contains integration and end-to-end tests for the data layer, backend services, and UI components. Written in Ruby, it leverages Watir for the UI and Minitest for the backend services testing. Its infrastructure includes a custom test runner that enables test parallelization on Jenkins, custom reporting, analytics dashboard, and a command line utility tool.
Otto was built before we even started discussing implementing continuous integration and deployment. By 2016 we had a lot of automated tests, but most of them were organized in an ad hoc manner. The scrum teams and their dedicated test engineers created and maintained separate test suites for their projects. To run a full regression for a single monolithic application release, we needed all the teams to run all their tests. Even then, it was hard to gauge the coverage that these suites provided.
Our goal for test automation in the CI world was to validate the state of the entire application for every single code change. The automation had to run fast and guarantee extensive test coverage at the same time. We had several hundred UI tests, and that number was continually growing, so it was expensive to run all of them for every commit. We also realized that most application changes were pretty isolated to a specific component, so there was no need to run the full regression testing every single time.
We came up with the idea of a small super-suite that would contain only the most critical tests. We named it Gatekeeper. The Gatekeeper tests define our baseline for quality and give an overall picture of the application health. These are the test cases for the most important user flows, and they cover all the components of the monolithic application. A failure of a Gatekeeper test should pause the CD pipeline, and a bug causing the failure should be considered a blocker. We would run the Gatekeeper tests (usually we refer to that suite as “The Gatekeeper”) often, for every single code change, so we wanted to keep its execution time under ten minutes.
The rest of the tests is a general regression — test cases that verify the state of specific features, components, and less popular user flows. A failure of such a test does not have a showstopper impact as it should be relatively isolated.
There’s also a separate suite of cross-browser tests that we run on Sauce Labs for top ten OS/browser configurations based on the analytics of our user traffic. It contains a limited number of end-to-end scenarios and is run overnight along with the general regression.
New development workflow
The introduction of Gatekeeper and organization of the regression jobs was the first step in streamlining our automated testing process and integrating it into CI/CD. With the Spacepods (ephemeral development environments), we finally had the convenience of private, isolated test environments. It allowed us to try something new in our development and release process.
In the first part of this article, we mentioned that in the pre-Capsela development process, we started testing pretty late in the game, after all the features had been merged and deployed to the QA environment. We wanted to change that. We wanted to start testing every single code change independently before integrating it into the main release branch — basically, to begin testing as early as possible, when the cost of error is the lowest.
To enable that, the development teams migrated to the new workflow. It started with a lot of manual steps, but eventually, we automated most of them.
Within that new workflow, a life cycle of a code change starts with a pull request that corresponds to a Jira ticket. Every PR has to include unit tests and to pass peer code review. Once the author of the code feels the PR is in a good state for testing, they send it to a test engineer. To validate a change, we need it built and deployed to a standalone ephemeral environment (a Pod).
Initially, we created the Pods manually through the Spacepods UI. Spinning up a new instance and manually deploying it was very time-consuming, sometimes taking up to an hour. At that point, it did not look like an improvement from the times when we had a shared QA environment. To solve this issue, the Spacepods team added a GitHub hook that triggers spinning up a new Pod automatically once a build is ready. The Pod gets the latest data snapshots and most recent versions of the code dependencies (e.g., backend app if we are deploying the frontend). It gets automatically rebuilt if there’s a new build version, which is great when verifying the bug fixes. Currently, there’s a reserved pool of pre-created Pods that are always available for CI, so the hook only triggers the deploy.
Once the environment is ready, test engineers begin testing the code change. We add automated scripts for every new feature and update the existing tests if needed. The GitHub hook that triggers creating the environment also kicks off the Gatekeeper tests(GK), which gives us quick feedback on the most critical generic application flows. To make sure the change does not break any of the existing (but less critical) functionality, we also need to run the general regression. With the Otto framework, it was a manual step. The test engineers used the personal test jobs on Jenkins to target the specific Pod URLs and configured them to run either a full regression or a more targeted one.
With the new and the old automated tests passing, test engineers sign off on the PR. If GK is green, it lets the author merge the code change to the master branch, where it should be picked by the CD pipeline.
The full CI/CD was not possible until we completely migrated off the legacy platform hosted in the data center. The Pods are AWS environments where we can only validate the Capsela changes. The higher environments — “QA”, Staging and Production — have been functioning in the hybrid mode, where services and pages were served from both old and new applications. Everything was very complicated. For example, we could have a UI test that validated submitting a lead, where the lead form triggered a call to the new Capsela API, which, in turn, called a legacy API, that was pulling the data from the new Capsela data source.
This setup meant that we could not automate the deployments on the integrated environments. So, until we finished migration of all the services and applications into the cloud, we built a semi-automated version of the CD.
Everything touching the legacy stack was super complex and fragile and thus required a lot of time to verify. It was not feasible to test and release every single change at that point. So we decided to batch the changes and go through the full release process once or twice a day, depending on the need.
A person who had the utmost interest in shipping their team’s code to Production served as release captain. To get a release out, they had to go through a series of five to ten semi-automated steps (the number of steps have been decreasing as we continued migrating off the DC). They started a release by “tagging” the latest commit as “in flight” with a Slack message:
While the tagged build was deployed and tested, the next in line was slated to be “on deck.” It was like a virtual queue, where everyone had to pay attention to that Slack channel to track the release progress.
Since the QA environment was in that hybrid mode where old and new applications co-existed, this is where all the integration issues revealed themselves. We could not rely on automated regression only. That is why for a release we got every test engineer to sign off for the regression test suites that they “owned.” The ownership definition was very arbitrary and mostly defined by the application components that engineering teams worked on. We had teams focused on the New/Used cars, Post Prospect, Dealer Portal, etc. So only the test engineers on those teams had the most context about the recent changes and could review and interpret regression results. We used Slack polls, and a vote on the poll meant that that person had covered the regression for their area:
After testing on the QA, we promoted the code change to Staging and Production. We got rid of the UAT environment and repurposed Staging for user acceptance testing. With the Gatekeeper passing on Staging, the build was finally cleared to go to Production. It might sound like the final step of a release, but it was not. Release captain still had to do a smoke test of the newly shipped change and had to keep an eye on the monitoring systems to catch any anomalies.
This process was not optimal and we knew that. It was a temporary compromise on the way to the fully automated CD with Capsela.
As we were close to finishing the migration off the data center, it seemed like we could switch to the full scope CI/CD instantaneously. We’d automate the manual deployment steps and would be able to ship every code change with little to no manual intervention.
Building Otto and integrating it with the rest of the CI infrastructure turned out to be only the first step. With the automated CI/CD, the development and deployment pace significantly increased, and test stability became essential.
Tests are supposed to fail only when they catch an application issue. In reality, we have often witnessed unrelated test failures that were “healed” on rerun. The most common reasons for UI test flakiness were:
- Missing or duplicated data-test attributes that we use for locating elements on the page
- A lot of Wait::Timeout::Error issues due to using a Ruby-based Watir framework for testing async JS application features
- Continuous A/B testing across all the pages with no unified strategy of controlling the versions
- Data issues — We’ve been running all our UI tests using the real data, with random inputs and selections, indirectly testing the data layer which often had issues that had nothing to do with the application code changes that we tried to deploy
- Feature environment performance — With hundreds of concurrent requests to the same endpoint setting the initial state for the application, several new tests failed with every single run.
The test engineering group took on the Make Otto Green Again (MOGA) initiative to review, cleanup, and stabilize the tests and address the issues above. It produced excellent results, but with the Otto platform growing to include automated tests for close to 30 different applications, keeping it continuously green was time-consuming. Depending on the project and the size of its test suite, some test engineers had to spend up to 80% of their sprint time on test maintenance.
The second challenge was the test platform itself. Developers started creating PRs with hooks enabled for CI and CD. One of the hooks is to run integration tests using our framework. Imagine hundreds of PRs with multiple updates sending hundreds of requests simultaneously to our test Jenkins server to run the tests… Jenkins started failing. At one point it just went down, and that could happen multiple times per week.
The Test Platform team took the initiative to scale Jenkins for CI/CD, and these are the factors that were addressed for Jenkins as a test runner.
- Distributed test execution jobs to reduce the load on one single job. Each application has separate jobs for CI and CD.
- Invested in better performing hardware (Cost vs. Performance). Moved to bigger Instance (EC2 M5 instances) with more CPU and memory thanks to AWS EC2 instances.
- Started storing less build artifacts on Jenkins. Screenshots from UI tests runs were taking more disk space on Jenkins, so had to move screenshots to AWS S3.
- Increased EBS volume size for more IOPS on Instances.
- Aggressively pruned Docker containers inside slaves.
- Reduced long queues of jobs waiting to be executed. Used auto-scaling slaves through AWS EC2 Plugin for Jenkins 2.0 version.
- Parallelized tests execution.
- Implemented Carpool Lane only for CD jobs so that deployments can finish quickly. Reserved separate slaves on Jenkins to run CD jobs only.
- Increased the number of instances in dev environment to enable running tests in parallel.
And the third challenge was the fact that our test and application code were in two separate codebases. Some of the basic tools, such as dependency manager, were different, which created additional inconveniences for the team members who wanted to contribute to both codebases. For example, the main backend repository used RVM, while Otto used rbenv. Unit and integration tests were split between the two repositories (unit with the app, integration in Otto). It was hard to get a good idea of complete test coverage and, as a result, engineers created duplicate tests in both repositories. This led to the additional troubles for CI flow since we now needed a way to time merging the application change with its corresponding Otto tests.
We have been actively discussing all these pain points on our CI/CD guild meetings. Some issues we fixed, but others required a much larger effort. Both MOGA and Jenkins optimizations were not long-term solutions but temporary band-aids. Overall, the testing pipeline worked, but the goal of a 100% stable test suite was not achievable without changing something in the test framework and test architecture.
From Otto to Cypress
Cypress is an open source framework for UI testing that comes with an interactive test runner and a selector playground, screenshots, and playback for every action, which makes creating tests a breeze. Additionally, Cypress runs directly in-browser, which makes the tests fast. And it is very stable thanks to its built-in retry and automatic waiting mechanism. It currently does not have cross-browser testing support, but it is on the roadmap of the Cypress team.
Working together with the developers, architects and dev ops, we have included the following criteria into our evaluation: the ease of development and maintenance, speed of execution, and level of effort to integrate it into the existing CI/CD pipeline. The agreement was to go with the Cypress, and we started to work on bringing it into our infrastructure.
We’ve been working with Cypress for only a few months, but we already see huge improvements in our development and testing process:
- Selecting a JS framework and co-locating it with the application code helps to bring down the barrier between developers and test engineers and encourages close collaboration.
- It significantly simplifies the CI/CD flow. The frontend apps are using AWS Code Build, and UI tests are now an integral part of that build process. With that, we no longer need to support Jenkins as our additional continuous integration tool.
- We no longer need separate suites of automated tests; we always run everything that is checked in.
- We no longer create tickets to add test attributes to the page; the engineers are adding them as needed when writing the tests.
- We have complete visibility into the application code and the other test suites (unit and integration), which allows to avoid test case duplication.
- The tests run 30% faster with Cypress.
- The tests are much more stable. We’re addressing the flakiness at its root: by changing the approach to seeding the data and by stubbing the external and unstable requests. The tests no longer rely on the ever-changing data produced by the data feeds. Instead, they use an internal API that provides the predefined test data. We have also set strict boundaries for the data input and selection requirements; e.g., if selecting a random make and model in every test was common practice, now we choose the same provided by the abovementioned API. We still see value in the random data testing, but considering the failure rate of such tests, it would not block the CI/CD flow.
- With fewer failures and false positives, there is more trust in the test results. Each failed test is reviewed instead of being re-run.
- By using Cypress Dashboard service, we save ourselves time and effort on supporting a custom parallelization and reporting services.
As for our monolithic backend application that was already leveraging RSpec for unit testing, nothing has changed here much. We dropped the support of the separate Minitest suite in Otto and migrated the non-duplicated cases to the backend repository.
Test Engineers organically shifted left in the process. We learned that a standalone test automation framework could turn developers away from being engaged in test automation. As we co-located the tests with the application code, everyone in the engineering group has started contributing to the automated test coverage.
We are finishing porting over the tests from Otto to Cypress, and we’re planning to disable the Gatekeeper soon in favor of a fully integrated Cypress test suite. Finally, we’re at the point when we are running full regression for every change in a reasonable amount of time. It allows us to deploy as many times a day as we need, with no hotfixes whatsoever.
There are still a few things to decide on. Currently, Cypress does not provide a capability for the cross-browser testing, so we are looking for a back-up solution here.
Having added Cypress to our main monolith FE app, we still need to find a way to leverage it for other small internal projects.
And we are still experimenting with the solutions for our test data challenge. Adding internal endpoint to expose the seeded data helped us stabilize the tests. But we still see value in having tests that leverage random data selection as they proved to be great in finding edge case bugs where manual testing would not have caught them.
Thanks to Eric Slick for reading drafts of this post.