Switching to Heroku CI
Since April 2018, the Doctolib test suite runs on Heroku CI and this post shares the motivation behind the switch as well as the insights we collected from it. Before the change over, it ran on a self-hosted Jenkins in combination with Browserstack for our end-to-end integration tests which makes up a decent portion of our test suite.
We made the switch for several reasons:
- The Jenkins instance had a non-negligible maintenance cost; it needed regular capacity upgrades, tune-ups, and random restarts.
- Browserstack had some stability issues that could affect us for an entire day every other month.
- Most importantly, both Browserstack and our Jenkins imposed certain limits on the number of concurrent builds that we could run. We had a definite number of test nodes on Jenkins that could start a build, and our Browserstack plan also had its limits — a definite number of parallel tests we could run, as well as a wait limit for the pending tests. This caused some serious delays for the builds during periods of peak developer activity. Sometimes our duty guy, responsible for the daily production rollout, had to resort to extreme measures; he could be heard saying something like, “sorry folks, I’m gonna have to kill your builds, I need a green one on production branch for deployment” in order to skip the build queue. With the growth of our developer team, this was a recurrent issue.
For regulatory reasons, we cannot host our application on Heroku. However, as some of us are big fans of the developer experience Heroku provides, we were aware of the Heroku CI release in May 2017.
Unlike other CI services like CircleCI or Travis, Heroku CI doesn’t put any limit on the number of concurrent builds you can run. As long as you are willing to pay for them, Heroku will spin some dynos to run your builds. When they announced Parallel test runs in beta in January 2018 we thought it would be worth trying to run our tests on there in an attempt to :
- Reduce the maintenance burden of our CI setup
- Increase its stability
- Provide our developers with a zero queue time build experience, and thus shorter build times
Our first step was to setup the stack so our tests could run on a Heroku dyno. Our stack is pretty standard at Doctolib; the original proof of concept application ran on Heroku and the architecture hasn’t massively changed since then, so we knew right off the bat that it would most likely be feasible.
Our main datastore is PostgreSQL and we use Redis for background jobs processing with Resque. As we wanted to parallelize our tests we opted for the in-dyno versions of official Heroku add-ons for these datastores. Elasticsearch is part of our stack and also required an in-dyno add-on, but unfortunately that didn’t exist. However, it turns out that the in-dyno add-ons are nothing more than buildpacks which are not that complicated to tinker with.
We forked the official Java buildpack from Heroku and a few commits later we had our own heroku-buildpack-ci-elasticsearch. We just recently submitted it to the Heroku Buildpacks registry (this was not an option at the time).
We wanted to cut our dependency to Browserstack so we followed the instructions for browser testing via Google Chrome headless which was made possible with another buildpack.
Google Chrome headless, unlike Browserstack, is clearly limited to a single browser. Yet when using Browserstack we never took the time to setup a multi-browser build of some sort and were only testing with the latest Chrome. Moreover this switch was necessary in order to obtain the zero queue time promised by the Heroku CI architecture and we were hoping to increase the stability of our resulting CI setup.
Finally, in order to cut down the setup time we also added a custom made buildpack to cache our assets generation and reuse them whenever possible.
Once the stack was properly set up, we then worked on achieving the right level of parallelization in order to have our tests run in a reasonable amount of time. On the existing CI, with an empty queue, our build lasted 20 minutes so this was set as our initial target.
On top of the parallelization by dyno offered by the Heroku CI Parallel Test feature, we added another level of parallelization with the parallel_tests gem which, according the Readme,”splits tests into even groups (by number of lines or runtime) and runs each group in a single process with its own database”.
Thanks to the in-dyno add-on we can create multiple databases, however, the Heroku Ruby buildpack has a built-in mechanism that prevents creating and dropping databases from rake tasks. The trick is to remove the file added by the Heroku buildpack to restore the original behaviour. Once we have the necessary databases in our test dyno, we use the ParallelTests::Test::Runner.tests_in_groups method from the gem combined with the environment variables CI_NODE_TOTAL and CI_NODE_INDEX provided by Heroku CI to split the test files once. We then feed our selected test files for the given dyno to the regular parallel_test command for another round of splitting per process.
Heroku CI currently limits the number of dynos running in parallel for a build to sixteen and this is the setting we currently use. Ideally, we would like to keep decreasing our build time, however, the Heroku CI parallelization architecture comes at a cost. As noted in the Heroku documentation about building applications, “buildpacks lie behind the slug compilation process. Buildpacks take your application, its dependencies, and the language runtime, and produce slugs… a slug is a bundle of your source, fetched dependencies, the language runtime, and compiled/generated output of the build system — ready for execution”.
The documentation about running applications on dynos then says, “Heroku executes applications by running a command you specified in the Procfile, on a dyno that’s been preloaded with your prepared slug”. Unfortunately, Heroku CI doesn’t work exactly that way; the slug preparation doesn’t happen before parallelization. It’s done for each of the sixteen dynos. Our test setup phase takes about four minutes and we are billed sixteen times for these identical minutes.
Getting to green
Finally, all that was left was to get a green build. With the change of setup came the appearance of a few random failing tests.. It seemed as though they were flaky tests which, when run with our previous setup now ended up inevitably failing. This was most likely due to the difference in latency between Jenkins and Browserstack compared to headless Chrome running in the dyno. Other than that, we had the usual timezone dependent bugs and a few adjustments to make to stabilize our Elasticsearch buildpack.
We didn’t actually entirely cut off our dependency to Browserstack because a small fraction of our tests are testing a Chrome extension and Chrome headless doesn’t support loading extensions.
In the process we also lost in terms of error reporting UI. On Jenkins we used the Blue Ocean UI and had even tweaked it with a homemade Chrome extension, whereas the Heroku CI UI is not quite as elaborate. The documentation recommends to output test results in TAP format but we couldn’t manage to achieve this in our Ruby tests. To remedy the issue, we had to build a remote test reporter.
Heroku CI is not cheap and given the growth of the tech team at Doctolib, our monthly bill keeps increasing at a steady pace. It’s difficult to make an apples-to-apples comparison between the different CI solutions in terms of pricing though. The self-hosted Jenkins solution involved human costs that we never really quantified and the other hosted CI services have a completely different pricing model than Heroku CI.
On Heroku CI, one build costs us around $2 (sixteen dynos times the build time, times the price per minute of a Performance-L dyno). This price will increase as we add more tests and the number of builds will increase as more developers join the team. On the other hand, TravisCI and CircleCI, for instance, offer plans with unlimited build minutes and collaborators. However, those plans are limited in terms of concurrency; the most expensive listed price for CircleCI at $3,100 per month can only run three concurrent jobs with a parallelism of sixteen containers. To attain the same level of concurrency that we use today on Heroku CI, this would definitely cost us more. To keep the cost down, we would have to let our builds queue, although if you believe the CircleCI blog; “Letting builds queue for more than a minute is like valuing your developers’ time at less than a dollar per hour.”
Overall we are rather happy with the switch. At the time, we went from a build which took a minimum of twenty minutes, and as much as fifty minutes during peak developer activity time, to a a rather stable ten to twelve minute build, all thanks to the zero queue time. Gone is the time of needing to provision new machines for additional Jenkins nodes upon the arrival of developers and the overall stability of our CI setup has improved.