Expedia Group Technology — Engineering

How failing faster could actually help your software team deliver more

An image of the left side of a keyboard
Photo by Michelle Ding on Unsplash

Introduction

This is the story of how a web app’s CI/CD workflow went from bad to worse, and how we reimagined the process to be more efficient and actually enjoyable for developers. A key part of this accomplishment was using the DevOps practice of “shifting left”, which refers to continuously testing application changes early and often to tighten the developer feedback loop. I was first made aware of the concept of shifting left by a fantastic presentation a colleague of mine at Expedia Group™ put together for the 2021 “Breakpoint” Browserstack summit back in March. The presentation inspired me to put this methodology into practice at a time when my team needed it most.

From Bad to Worse

Continuous Integration (CI)

The web app in question is “shopping-pwa”, which is a progressive web app that implements Expedia’s lodging shopping experience. Like most good applications, we had a series of automated checks that would run for each new pull request to the repository’s master branch. These checks included unit tests, code linting, schema validation, and performance tests, to name a few.

Long before I started at Expedia, we had always been using Jenkins to run these PR checks. Developers had started to develop frustration and animus towards Jenkins as a whole, and for good reason. Jobs were anecdotally very slow, and most developers I know hate waiting. It even took between 2–3 minutes just to checkout our repository, let alone run a test against it.

Since we were using an “on-premise” Jenkins instance, each job would run on a physical machine that was provisioned each time we requested one. These physical machines would be constantly reused, sometimes running various jobs at the same time. Because of this, we experienced the phenomena of one job performing some filesystem “cleanup”, which would corrupt the execution of any jobs running simultaneously on that machine, causing them to fail. These failures were “false negatives” because they were failing due to a limitation of the CI tooling, not due to the application itself.

Additionally, a lot of our jobs seemed to have a lot of the same steps. But each Jenkins job needed to have its own set of configurations that were visible only inside Jenkins’ unintuitive user interface, and there was no way to distribute any shared code or configurations. These configurations were also often hidden or hard to find within each job. As a result, almost no developers had a clue what each job was actually doing under the hood, and it was hard to troubleshoot or resolve job failures.

And if a developer wanted to make a change to one of these Jenkins jobs, they could be made without any peer review and would go unnoticed. You just had to edit the configuration and save it. This lack of peer review caused an egregious issue where we thought a particular test was passing correctly for several months only to find out later that a harmless-looking config change was causing the job to fail silently the whole time.

Continuous Delivery (CD)

One of my first projects at Expedia was for UI automation, where we built out a comprehensive automated test suite for shopping-pwa that mimics user interactions on the app. We wanted to run these same tests across many different configurations, such as large/small screen sizes and several experiment variants. We ended up placing these tests in our deployment pipeline, and they would run right before the release step, serving as a final guardrail prior to releasing changes to production. Due to our many configurations, we were running 8 instances of the same Jenkins job in parallel, which constantly chewed up the resources available to those physical machines for other purposes. This resulted in longer queue times for other Jenkins jobs.

Since these tests often caught bugs before they made it out to customers, this was a huge win! However, all these pipeline tests were slowly making the developer experience more cumbersome. To illustrate this, here is an image of our pipeline at this point:

An image showing a complex Spinnaker pipeline with many steps, including the “Build” stage which is very slow.

If someone took the necessary time to get all of their PR checks passing and merged their PR to master, this is the journey they were embarking on. When everything ran smoothly, the total pipeline run time was typically between 70–80 minutes. “Monitoring the pipeline” was a required task associated with merging a PR because if one of the tests failed, you would be responsible for retrying the job (in case it was being flaky) and investigating the failure. This process would take up even more time and block everyone else from merging their changes in, since only one pipeline could run at a time. In the case where one of the pipeline tests was legitimately failing due to some oversight on your part, you had to create a revert PR that undid your change, and find someone to help you merge that. This would then trigger another pipeline that you had to wait for before the next person in line could finally merge their change.

In total, this process could take up to 3 hours even with the most diligent monitoring. And as for the developer who was next in line? All they could do was sit and wait patiently. This blockage would usually cause a pile-up in the “merge queue”, which the team used to devote entire days at a time to clearing out. All this just because of a tiny little test failure! And not only did these test failures occur past the “point of no return”, but it often took over 45 minutes for the failing test to even start running!

Reimagining Our DevOps Workflows

Starting the Shift Left

Towards the beginning of 2021, we started hearing about this new internal feature which would allow teams to deploy branches to an isolated test environment with a unique internal URL. We call this environment an “island”. The island had obvious user acceptance testing (UAT) applications, but pretty soon folks figured out that they could run automated tests against it, just as they would a canary environment in the deployment pipeline. As more and more teams onboarded to this island deployment process, our team finally hopped on the bandwagon. We set it up such that each new commit to a PR would trigger a job that packaged up our app (just as a we would do for production deployments) and deployed it.

We also had this performance PR check in place that would build our app in Jenkins and run Lighthouse tests against localhost:8080 within the same job. This took up a lot of memory and often took 20–30 minutes to complete. We figured if we were going to be deploying to the island with each new commit for UAT purposes, why not just run these performance tests against the island? So we scrapped that monstrous Jenkins job and replaced it with a much simpler one that exclusively ran Lighthouse tests against the island URL, and triggered it on GitHub’s “deployment status” success event so that it would immediately start running once the island was ready. As a result, the island deployment became a required step for merging a PR to master.

Soon after, someone brought up the fact that one of our pipeline steps was taking over half an hour to complete, and this made them very unhappy. This was the pipeline’s first step, the “master build” step, which packaged up the application and pushed the resulting Docker image up to our Artifactory repository. I noticed that there wasn’t much difference between this master build job and the island build job we were now running for each PR. I thought, “why are we building and pushing an image for each PR, only to build and push an image to the exact same place after we merge? Why are we doing double work?” After a few minor tweaks, we were able to match the way in which we built the island Docker image with that of the master build job, and totally removed that step from the pipeline! This was a milestone because it marked a leftward shift: we had migrated a release step closer to the initial development stage in order to run it earlier in the process. And as a bonus, our pipeline now ran 20–30 minutes faster!

An image of the same pipeline, with the slow “Build” step removed. Now, the next steps in the pipeline run sooner, and the entire pipeline finishes faster.

See Ya Later, Jenkins!

Around the same time, developers were becoming fed up with Jenkins. After countless guild meetings consisting of complaints about our developer experience, I decided to do some research on alternative CI tooling that we could use. I had used Circle CI before, and had been impressed by how easy it was to build CI/CD workflows that speed up the development lifecycle. I came across AWS CodeBuild and did a successful Proof Of Concept (POC) that showed drastic improvements in job runtime and solved our configuration nightmares in Jenkins. After pitching this to some folks in our delivery platform team, I was told that while we aren’t going to support using CodeBuild, we were soon going to have access to GitHub Actions, which was GitHub’s new integrated CI/CD solution. This greatly excited me and prompted me to do a POC with Actions as well. I discovered that Actions was just as powerful as CodeBuild, except it was even easier to set up and offered more capabilities with our use cases. With GitHub Actions, each job is totally containerized, checking out repositories takes 2–3 seconds instead of 2–3 minutes, the job configuration lives right within the repository instead of in an obfuscated location, and you have access to GitHub events which allow you to be creative with how and when you trigger various CI workflows. I instantly knew that this was a tool we desperately needed.

Once we finally got access to GitHub Actions, I started to migrate all of our PR check jobs from Jenkins to use them. I used the “paths” feature which allows certain workflows to only be triggered if relevant files are changed in a given PR. For example, we probably don’t need to run our 5-minute unit test job if we’re simply making a change to a markdown file. I also aimed to make each workflow as simple and understandable as possible, only doing exactly what we needed for the check to run. Pretty soon, our PR checks were running much faster, only running when they needed to, and were becoming more reliable. Other developers showed interest and offered suggestions as well, which improved these checks even further. Our continuous integration process was finally getting some love.

Making our Continuous Deployment Pipeline Continuously Deploy

Now that our CI was in a much better place, I decided to reevaluate things on the CD side, and started asking more questions. Why were we doing so much stuff in our deployment pipeline? Shouldn’t our pipeline tests be run at the PR level so that we can know about failures faster? I quickly made it a goal of mine to make our pipeline to do one thing: deploy.

Armed with the awesomeness of GitHub Actions, I went to work on redesigning our UI automation jobs to run against the island environment (at the same time as our performance checks). There was unfortunately some tweaking to be done with the tests to make them compatible with the nuances of the test environment. This disappointed me because ideally we should be testing against the most production-like environment possible so that our tests are actually meaningful! However, I was satisfied enough with the resulting test jobs that I felt we could iterate going forward. I was able to use the “matrix strategy” feature of Actions, which allows you to define a single workflow and run it with many different configurations in parallel. This made the configuration very declarative, which often yields the best readability.

Once these tests were running consistently for each PR, I was able to remove all those testing steps from the deploy pipeline. I also combed through each of the pipeline steps and either removed them if they were unnecessary or separated them out into their own pipeline if they were irrelevant to deploying a change to production. The result is what our pipeline looks like today:

The same pipeline, but simplified greatly by removing everything that isn’t doing deployment. All the branches and operations that were running tests were moved to other GitHub actions that run earlier in the development process.

Reaping the Rewards

Our new pipeline runs in around 15 minutes, which is night and day from where we started out. Since all of the work has been pushed fully “left”, our team now knows right away when something has gone wrong with a change they are about to merge. Instead of merging code and waiting to learn about a test failure, our team can know very early on in a change’s lifetime. If a PR is first in line to merge and uncovers a test failure, that PR can simply be taken out of the queue and allow the next person in line (whose tests have all passed) to merge right away and see their change in production 15 minutes later. We are no longer afraid to hit that “squash and merge” button, because that step is now a trivial piece of the larger CI/CD workflow that is now in place. We even recently got access to the GitHub “auto-merge” feature, which allows us to automatically merge PRs whose requirements are fulfilled. Instead of allowing fear or manual intervention control our lives, we now have our seamless yet powerful workflow to ease our worries about releasing new features in quick succession.

I would strongly recommend that all teams take a good look at their CI/CD pipelines and explore whether deploying in 15 minutes or less is possible. For my team, leveraging GitHub Actions was simply an implementation detail. Looking back at this incredible adventure, I have learned that the secret to releasing quickly with quality is failing fast in a way developers can easily understand.

--

--