Continuous Improvements: 5 ways we’ve improved our build process at Treehouse
At treehouse we don’t have a dedicated QA team. That’s something that we’d like to add in the future, but for now we rely on continuous integration (CI) to help us avoid mistakes.
Every time we push changes to GitHub, a new build is started on CI. Each build goes through several setup steps and then runs the entire suite of RSpec and Jasmine tests. Finally, when all the tests finish, we get a nice summary of the results. This feedback is one check that we use before deploying code to production.
The feedback that CI provides needs to be fast and more importantly reliable. Engineers rely on this information and need to be able to trust it. When this feedback is slow, engineers waste time waiting on CI to pass before they can deploy. If builds randomly fail, more time is wasted re-running the test suite. Even worse, after a while people will simply start to ignore build failures before deploying. This is a dangerous habit to get into and inevitably leads to errors making it through to production. Build failures should be the exception not the rule.
Earlier this year we found ourselves in this exact situation, our build process was slow and tests would randomly fail. While we focused hard on adding new features, our build started to suffer. As a result our times increased from under 5 minutes to almost 20 minutes. This added at least 15 additional minutes to our deploy process. We deploy to production several times a day and this really started to add up — 4 hours per week, 16 hours per month, or 24 days over the course of a year. Every developer and designer felt this pain directly so we took a step back and spent some time on our build process.
Here are few different approaches we used to improve our build process.
Compile + cache assets
Now we precompile assets and cache tmp/cache/assets between builds.
$ ln -s ~/cache/assets/test tmp/cache/assets
$ RAILS_ENV=test rake assets:precompile
We’ve since upgraded to libsass and the time it takes to compile our assets went from around 5 minutes to under 1 minute! We might be able to get away without compiling assets during builds, but this let’s us keep an eye out for performance issues. If we notice a spike in the compilation times, it’s a good indication that we have a performance regression somewhere.
We have used this in the past to detect regressions in our Sass styles. At one point, we added some keyframe animations that resulted in a huge amount of additional generated CSS. By using git bisect and timing the execution of rake assets:precompile, we were able to find the specific commit that introduced the regression then address it.
Retry intermittently failing tests
We occasionally have tests that intermittently fail on CI but pass locally and prove to be especially hard to diagnose.
For our first attempt at fixing these failing tests, we created a list and started to fix them one-by-one. We would fix one randomly failing test and another one would crop-up soon after. This process took a good amount of time and effort and was extremely frustrating.
Instead of trying to fix failures like this, we accept that some tests will fail intermittently and simply retry them. It turns out that other teams have run into similar issues before. The rspec-retry gem handles retrying intermittently failing tests seamlessly.
Profile ActiveRecord queries
Unit tests that perform a lot of database queries are unnecessarily slow. We’ve found Factories to be really great and this makes it easy to create test data. Additionally, they can also hide test setups that result in a high number of ActiveRecord queries. Obviously, factories aren’t the only source of ActiveRecord queries, but it’s a good place to start looking. When we profiled our code, we found that our worst offenders were performing over 1000 queries per test. Yikes!
One approach is to use ActiveSupport::Notifications instrumentation to profile ActiveRecord queries in our tests.
Add to spec/spec_helper.rb:
Run tests in parallel
RSpec typically runs in a single Ruby process. Most computers today have multiple cores, so a single process only takes advantage ¼ or ⅛ of the total processing power. We use the parallel_tests gem to distribute individual tests across multiple processes, taking advantage of all cores.
The really neat thing about this gem is that it can be configured to write a log of how long each test takes to complete. The next time around it will use this log to group tests into similar sized chunks. Each group runs in a separate process and this evenly distributes tests across all available processors.
Running tests in parallel requires them to be isolated from each other. This means that in addition to running in their own process each test group has an isolated MySQL database. For Elasticsearch, Redis, and memcache we use namespaces to isolate everything.
Use a hosted CI
All our builds run on Codeship and we’ve found several of their features to be useful. ParallelCI is one that allows us to run our test suite in parallel. Along the same lines, they also allow for multiple concurrent builds. This enables teams to build completely different branches at the same time. Codeship also caches dependencies between builds and we take advantage of that to cache assets. By using a hosted CI we spend less time less time configuring and monitoring a self-hosted system which means we have more time to focus on creating a great product for our students.
Our build is in much better shape now, times are back down to around 5 minutes and we no longer have randomly failing tests. Like most performance improvements, our build process is something that we’ve had to continually work on. We’ve tried a lot of different approaches and have gone through many iterations. This is something that we’re never going to really be done with, and we’ll need to stay proactive as our product evolves.