Our old workflow at Grailed

Continuous Delivery at Grailed

Jose Rosello
Grailed Engineering
6 min readJul 2, 2019

--

CI is taking too long

Don’t deploy now, I’m about to deploy my change

Taking over our QA environment, let me know if you need it

These were the typical messages one would read while scrolling down the Grailed engineering Slack channel on any given day last summer. Although our development process worked fine when Grailed was only a handful of engineers, it started to become a bottleneck as the team grew. After a few frustrating months, we decided it was time to reevaluate our existing process.

Although every system is unique, we had a common list of pain points when we decided to review our development workflow — long-running builds, long-running test suites, confusing CI configuration, and QA environment contention all worked in concert to create a less than ideal experience.

Before diving deeper, it’s important to understand that Grailed is a monolithic Rails API with various decoupled clients (web browser, iOS, and Android). The process outlined below refers to the building of our Rails API and web browser code; iOS and Android are built and tested differently and the topic for another blog post.

That being said, this is what our development workflow looked like at the time:

  • The developer’s feature branch was pushed to Github, kicking off a build and running our test suite on Travis CI
  • Once ready, the developer opened a PR on Github
  • Once the PR was reviewed by a peer and the Travis CI build finished successfully, the developer would let others know over chat that they would be merging their change and deploying the latest version of master
  • The developer would merge their PR on Github and immediately push master to Heroku through a script on their machine

Three of our big pain points came alive during that flow: the long ~28 minutes our build and test suite would take (8 minute build, 20 minute test suite), and the herculean coordination effort necessary for a developer to release their changes.

Improving our build and test suite times

We identified the test suite time to be one of the largest contributors to developer unhappiness and as a bottleneck to the modernization of our deployment process. We felt that having a long testing cycle would make the feedback loop between automated deployments and code merges too long to be safe; someone may commit a change at 11am and then have it land at 12:30pm when they’re out to lunch and unable to inspect any potential issues. It would also discourage developers from iterating and making revisions to their PR’s once all the Github ✅ appeared, since that meant coming back to it half an hour later (and if one of the tests acted in a flaky way, another half hour!).

A lot of the improvements we made to our build time are specific to our technologies, and the first large wins were low-hanging fruit:

  • Replacing sass-rails with sassc-rails to speed up our SASS compilation step
  • Skipping Webpack’s JavaScript minification step during testing (perhaps controversially since it doesn’t match Production, but the time savings are too good to pass up on)

Using sassc-rails and skipping the minification step on Webpack bought us back 4 minutes on Travis CI, but we wanted to do better than 16 minutes. We wanted to take advantage of parallelization and better caching to do that, but we ran into some trouble.

Migrating our CI environment and Docker

Although we liked working with Travis CI, we had accumulated some tech debt in the form of scripts that built our environment (such as getting the right version of PostgreSQL) which we had to run every time and we were unable to cache. When trying to parallelize our build, we also started to run into hardware limitations and were left wondering how to properly use the hardware at our disposal.

In searching around for alternate CI vendors to address the limitations above, we found Codeship to be the best suited to do this: it supported Docker and its own flavor of Docker Compose out of the box, so we had complete control over our build and how it was cached. It also let us choose and profile the kind of hardware we needed to run our test suite. Codeship allowed us to:

  • Parallelize our Cypress integration tests
  • Parallelize our rspec tests by using Knapsack
  • Rely on Docker to ensure our builds were aggressively cached with minimal, declarative configuration

After doing all this work, we could consistently run our build in less than 8 minutes, reducing our build time by roughly 60%.

Improving our QA process

With the test suite time under control, we decided to focus on the next biggest pain point: contention over our QA environment.

QA is Production-like — it has a recent copy of our Production data and interacts with all the services Production would. It’s imperative that we test all but the most trivial changes on it. Having a single QA environment meant that developers had to negotiate whose branch would be deployed at a given time and added a layer of overhead and orchestration that would certainly not scale with the growth of the team.

One alternative proposed was to set up handful of static QA environment to “load balance” our needs, but this proved unnecessary with the introduction of Heroku Pipelines. All Heroku applications can benefit from this feature, which includes a way to directly “promote” binaries and built applications from QA to Production environments, as well Review Apps, a feature we used to remove environment contention entirely.

With Review Apps, one is able to automatically create a new application for each PR opened on Github. It copies the environment variables used by QA, and can optionally set up different dependencies for it. In our case, we chose to connect all these apps to the shared QA PostgreSQL instance while instantiating other service for each (these include Redis, memcached, hooks for logging, etc). This made sense for us because we still wanted developers to test their changes against a Production-like data set on their review apps. We have to be mindful of QA’ing migrations since they’re introduced across all review apps as a result, but it’s rare for us to introduce breaking migrations in a way that would force a developer to rebase their branch and rebuild their review app.

Since developers no longer needed to deploy their branches to a single QA environment, we decided to have our original QA environment track master, which made it very useful for mobile clients to test their applications against the latest codebase.

Improving our deployment process

Things were looking much better by this point — our test suite was fast and we no longer had to coordinate QA deployments — but we still deployed our changes to Production one at a time, coordinating over Slack and running scripts on our machines.

The first improvement was removing the need to run the script, and once again take advantage of Heroku Pipelines to “promote” our QA masterbuild to Production with the click of a button.

The second improvement came quickly afterward, and it was the last step before we moved from performing Continuous Integration to Continuous Deployment. We automated the Production build in the same way we had automated our QA one — Production was changed to automatically track master, so merging a PR on Github would trigger a CI build on Codeship of the master branch and, upon passing, trigger a build on both QA and Production environments on Heroku. We felt some psychological discomfort when going through this transition, but we had a good deal of confidence in our test suite and peer review / QA process and thought this kind of frictionless deployment was worth the risk.

Aftermath

Although it may be hard to directly measure the consequences of having less friction in our development workflow, it suffices to say that it would have been impossible to keep our old approach today with almost twice the number of engineers working on the codebase (a number which is only getting larger and could include you).

It’s been almost a year since we made the change to Continuous Deployment and we haven’t looked back. Our heavier days see around 30 smooth deployments. It’s encouraged folks to make smaller PR’s and deploy more often, which has made it very easy to identify any issues we do end up introducing. Feedback from old and new Grailed engineers has also been very positive thanks in great part to the reduction in deployment complexity and risk, the fast test suite, and the ability to test in isolated QA environments.

--

--