Daily Deploys Done Right

Published in

Upstart Tech

14 min readSep 2, 2022

I want to tell a tale of my journey at Upstart, but through the lens of deploying code.

When I started at Upstart I had my first feature, which was experimenting with auto-approving small dollar loans. I finished my work, and wanted to get it deployed. I asked our team manager how that happened — and he said once it’s merged someone will take it to production for you.

The deployment process wasn’t time bound, also didn’t clarify who was responsible for the deployment.

At the time we had an all-engineering pager duty rotation, where all of engineering would rotate through on pager duty for issues that were not assigned to squads. If the pager went off you triaged the issue. Also part of this rotation was deploying to production. We were triaging things from tools like Bugsnag, Sumologic, Datadog, PagerDuty, and others.

As Upstart grew so did the burden of daily pager duty. The person who was in charge of the Pager Duty rotation rarely had time to actually do the deployment.

Engineers rarely had time to deploy the code that was queued up.

We had a painful process to release code, so it wasn’t done consistently. When something is painful, do it more. We needed to deploy consistently.

When something is painful, do it more.

We went through a process to figure out what was wrong, and fix the issues with deployment.

Like every new problem, we wanted to lead with data, so we gathered data around the current state.

After gathering the data we determined we wanted to make the system stable, make the system fast, and then clean it up and save money.

General Build and Deploy Information

Deployments in general are part of an overall process, where you validate an artifact, build the artifact and then ship that artifact to an environment.

The validation and build steps are CI. The shipping is CD.

Continuous Integration (CI) — “Continuous integration (CI) is the practice of automating the integration of code changes from multiple contributors into a single software project” (Atlassian)

Continuous Delivery (CD) — “is a software development practice where code changes are automatically prepared for a release to production” (Amazon)

Continuous Deployment (CD) — “is a software engineering approach in which software functionalities are delivered frequently through automated deployments” (Wikipedia)

Gathering Data

With CI and CD it’s hard to figure out why something failed without digging into the specific error message. If the pipeline fails in a certain step like running tests, there can be a number of reasons why it failed.

The test itself is bad, or the code under test is broken
The code running the tests could be bad (we were running tests in parallel, with custom code)
The infrastructure running the tests could fail
3rd party integrations can go down

From manually gathering this data over almost a year, our resultant chart is pictured below.

This isn’t super clear around which ones to dedicate time to, so let’s break it down by ‘type of failure’.

Aha! We can see the infrastructure failures, inconsistent tests, and implicit merge conflicts happen the most often.

Stability was our main concern

We also knew that our pipeline was way too slow, it was over an hour to run through CI.

We built datadog dashboards to show timings by parallel stage.

This helped inform us which steps to improve

In this example, improving the build step would not yield an overall reduction in the pipeline time, because it’s handled in parallel.

Again, the prep_env step:

Improving the client cache creation would not yield overall reduction in time. It could however save us money — which was not a concern in the beginning.

Slow CI = Grumpy Developers

We also knew our CI and CD spend was way too high. No one really directed us to change the spend, because we were comfortable spending more money to go faster. As we wrapped up our project we wanted to see what our overspend was.

What to improve

We grouped the improvements into 3 big groups.

Stability
Speed
Cost Savings

Make it stable

It doesn’t matter how fast you go unless you know the results are valid. A slow test suite that works consistently is not great, but it’s better than a fast, broken pipeline.

From the chart above we focused on:

Infrastructure failures
Inconsistent tests
Implicit merge conflicts

Infrastructure Failures

If the systems you use to run the pipeline are not stable, your code running on those systems can’t be blamed for stability issues.

Systems we worked with

Jenkins

Jenkins was out of date, so we upgraded the version of Jenkins first.

We still suffered from jobs queuing in Jenkins, so we scaled Jenkins out horizontally and moved non-monolith builds to a separate server. We also split the deploying branch from the feature branches.

Then we reduced the number of concurrent jobs running so Jenkins didn’t fall over anymore.

Upstart treated each infrastructure failure as a high priority failure of the system. We root caused each failure and fixed it so it wouldn’t happen again.

Pipeline Code

When the project started we were all in on Jenkins. It was using groovy and a mash of bash scripts, OpenShift templates, and a shared library… none of which had any automated tests.

We moved code out of the shared library to the monolith, because it wasn’t really shared.

We also created a number of python scripts with a non-zero amount of tests. We never fully converted the pipeline over, but we converted a few critical parts.

Inconsistent tests

We made some tough decisions around making the test suite more stable. A lot of these I would not do again.

The colors indicate if you should do it, green = go, red = avoid, yellow = use caution.

Removing Randomness (Don’t do this)

Randomness helps you find issues in your code that are unexpected. We regularly found issues with our tests and not with the code under test because of randomness.

Rspec randomness

We set the kernel seed in spec_helper

Kernel.srand config.seed

As well as the Faker seed

Faker::Config.random = Random.new(config.seed)

In the command to call rspec, we also set

 — seed 123456789

We also set our test splitting to not be random anymore. We had a lot of issues with tests impacting the tests after it. There were a number of reasons why:

Class variables
Any_instance_of in a global scope
Database reset issues / contention issues

Adding retries (Do this!)

Cucumber allows you to set retries in the command:

bundle exec cucumber — -retry 2 — -no-strict-flaky — -strict

We added rspec/retry to rspec our fast non-parallel running tests.

Quarantine Tests (Be careful!)

Tests which had non-deterministic outcomes were flagged, sent to teams to quarantine, and then add back in when they figured out the root cause as to why the test was flaky and re-wrote the test.

Implicit Merge Conflicts

We defined these as code that worked in the feature branch. Someone else wrote a feature branch. Both PR’s get approved and merged in roughly at the same time. The result of the two merges caused one of the PR’s to no longer function.

We didn’t block on being up to date with our default branch. The CI at the time was ~ 1 hour, so being up to date would be basically impossible.

Enter the merge queue.

Merge Queue (Use one!)

We worked to add a merge queue to solve the Implicit Merge Conflict problem. The vendor we selected wasn’t able to handle our scale, and after working hard with the vendor we decided to move off their solution.

During a hack week session we built a queue that would run batches of code together (all, halves, and quarters) — and merge the largest successful batch. We still run this today, and the cost is much lower than a 1 by 1, or batched queue from our previous vendor.

This type of custom queue also allows us to do things like:

If the PR fails in a batch, assign it a score. If the score is too low, sync with the default branch.
Visualizations on the risk of the current deployment / batch
Flakiness of the test suite

Github has a merge queue we are planning on piloting and switching to for our non-monolith repos. Start with your vendor’s supported solution if you can.

Learnings

Random is good!

Randomness is valuable to the long term health of your test suite and code base. Removing the randomness from the ordering, the seeds, impacts your ability to detect real issues in production. Do not consider doing it. Fix the tests and code instead.

We moved our randomly split tests back to pseudo random, based on time the test took to run and shared the load across the pods. Attempting to bring down the overall pod time, and variance between pod run times.

Retries are good, if not masking major issues

Retries allowed us to mask issues with database contention. Once we figured out why our tests were flaky from a macro level, we experimented with expanding the number of databases tests interacted with, and largely solved the retry problem. We kept the retries anyway.

Quarantine with care

If tests don’t get back filled, you leave a coverage gap. We didn’t have any issues from this, but if you don’t have good buy in from squads — be really careful.

Merge queue is good

As we grew this made life for developers much better, they weren’t being asked to triage code that broke from someone else.

Make it fast

The initiatives starting CI time, ~ 80 minutes

Nearing the end of efforts, < 17 minutes

We had a stable build system. It took you over an hour to get things through CI, and another hour to get it ready to deploy. But it worked, finally.

An hour is an incredibly long time to wait for feedback about your changes. ~17 minutes is also a really long time. But 17 minutes gave us a 4x increase in feedback to developers.

Steph wrote a much more in depth article around this, please take the time to read it if you want to go deeper.

Code Improvements

Cache Images for testing and build

In our CI system we have the concept of builder and runner images. The builder images are what’s used to build from zero. The runner images are a pre-built image with libraries installed on them to reduce our setup time. In ruby when we bundle it should take next to zero time to install gems.

We were not re-building these runner images nightly, so we transitioned towards are more frequent build cycle.

Fast track our data change Pipeline

We had data changes as part of our deployment pipeline. We extracted the data-only changes to a ‘scripts’ folder, and allowed them to be reviewed and run outside a full deployment. This cut our PR count in half for deployments.

Skip stages on certain changes

There were some folders that had code that didn’t require certain stages to run, so we started skipping them based on the change in the PR. This also saved a non-trivial amount of time.

Scale Pods

The tests took ~28 hours to run locally. We were running pods to run the tests in parallel. We balanced the pods and built reports around the time each step in the process took.

Include others

We worked with talented software engineers to help understand what they found valuable, and remove everything that didn’t make sense to run anymore.

End state

Our scope was over 12,000 builds a month — we cut the overall CI time to less than half. That’s over 6,000 hours of CI time reduced. That’s developers not waiting for their changes. It’s a massive productivity boost for the SWE organization.

We saved over 6000 hours of CI time

Process Improvements / Deploy Predictably

Our process when we started looked like this…

I highlighted the people involved in the process.

We would run CI, deploy to staging — then have a ton of people validate things, get sign offs, then go to production… where a person would manually watch the system.

There were a few things wrong here:

Manual testing was a primary sign off
SWEs were required to manually test every PR
E2E tests caught things, but were automated — and finding issues in Staging would be a huge derailment
People were looking at the release with their cave-people-eyes

Release Branches

Upstart has hundreds of developers, all trying to merge to master, and get code to production. We have issues in staging that we need to address and try again.

We needed a way to modify the known good + today’s release, without picking up all the PR’s merging throughout that process.

We also had issues with high priority releases derailing the daily release.

We added release branches to add a “Bookmark” to what we were working on. We could add as many as we wanted to handle the deployments independently.

Cost of Detection

You’ve almost certainly stumbled upon this graph, or one similar like it in your software engineering career. The Y axis is sometimes different, but the exponential curve is always there.

Detecting things later in the process is always harder to recover from. Our detection in staging caused a ~4 hour blocker to our release (slow CI, triaging issues, etc.). This was killing our ability to deliver on time.

Manual Testing

Our Quality Engineering team was manually testing every release. We worked with them and stakeholders to talk through the risks that this brings. They automated most of their ‘every day’ checks, and now do simple exploratory testing.

End to End tests

Our End to End (E2E) tests are acceptance tests in our staging environment. They check things like “is this other service running?”, and that they get the expected responses back. It also implicitly checks the full health of the environment.

We use a lot of experiments, and a healthy dose of LaunchDarkly flags. Engineers test in Staging-1 as well.

All of this makes our E2E tests “fail”, when we could have easily gone to production, without issue.

We continue to work with our SDET team to move our tests into the CI pipeline, and think about validating the same code paths with mocks, and then validating the mocks in staging. This would give us the same assurances that a full acceptance test does.

It runs the same code paths, but mocks out the services
The services should have a contract with our monolith
We validate the contract
A = B = C

Software Engineers (SWE’s) were manually required to validate their changes

The SWE organization was requiring their SWE’s to validate each change as it went to staging. This was generally just a huge cat wrangling exercise for my team, and in general people were giving an approval without doing anything.

We ran a poll, and gathered data. This was presented to engineering leadership, and we stopped requiring everyone to validate their changes in a staging environment.

Visually Validating Releases

When code hit production we used bugsnag to monitor unwanted behavior. This was done over a period of time, and issues were triaged. Sometimes we rolled back.

One of our Directors in parallel was running “Engineering Excellence” reviews. This helped teams think about how their applications were being monitored, and which endpoints required a rollback if they encountered an uptick in 500 error codes. There is also space to talk about variance from the norm, and how that’s captured and reacted to.

One of these is automat-able, one of them is not. Which one do you like better? Teams self reporting issues, or someone hoping there’s no uncaught exceptions?

Our End State Process

So many fewer people, so much more automation!

Cleanup and Save Money

Lastly we dug into the cost — I gave a presentation about the ~6000 hours of time we saved, and we had numbers around cost reduction. We took it a step further, and reduced the wasted infra allocation.

Cleanup

We also noticed that in Github we had thousands of dead PR’s.. We added the stale bot to our monolith — https://github.com/marketplace/stale which helped clean out the cruft.

We audited all branches in github, and closed anything automatically older than 6 months with no changes on them. We audited everything else and closed unused branches developers had left around.

This didn’t directly affect CI time, but it appeared to improve the Jenkins webhook response times. It was done originally for our merge queue solution, but we continued the practice afterwards.

Scale down

We used tools in AWS (Cost Explorer) to find gaps in our usage in our CI pods. Specifically the pods running tests.

We cut our memory usage down, and the number of CPU requests

We did adjustments like this throughout the pipeline, resulting in over over a million dollars in cost savings in our CI system a year. No one noticed the changes, except our finance department!

Gathering More Data (End to End Pipeline)

The chart might be hard to read, so here’s an easier to consume set of data

We re-ran the data at the end of our improvements, the Implicit Merge Conflicts still exist — we were measuring this during turning the queue on and off and on again.

But > 50% of every single failure in the default branch is related to code issues. This value trends up closer to 80% when you include all the various lower percent categories. This 80% of issues was reduced even further when we upgraded our k8s cluster. We’re much closer to 95% of the deployment issues being related to code.

Deploying is no longer a Pipeline Bottleneck

What’s next?

At Upstart we have a monolith, removing code and isolating code is challenging. It’s also the path of least resistance to add code to the monolith, even if it’s a net-new service.

We are going to fix that problem, and make the easy thing — the right thing. Get your code out of the monolith.