Increasing productivity with Drone
In this post, Laura Martin, our Site Reliability Engineer, explains how we used an open source tool to improve the developer experience.
In modern software practices, Continuous Integration is a key component of ensuring high feedback loops for developers writing and shipping code. The quicker the feedback, the quicker something of value can be shipped to users.
The main application that runs FutureLearn has a comprehensive RSpec test suite to provide confidence to developers that what they’re shipping is safe. When I started here, the test suite was taking anywhere from 10 to 20 minutes to run, which we wanted to improve.
The variation in time was caused by contention by using a shared SaaS Continuous Integration platform. If other customers were busy, our test times would be increased. Besides that, the platform was older, so we were tied to older dependencies and could not update parts of our application.
We determined that our goals would be to improve reliability, performance and flexibility. While not the main motivator, cost was also a factor.
We tested several CI products, and ultimately decided to go with Drone. Drone is an open-source CI product written in Go, based on Docker with simplicity at the core of the implementation. Rather than compare the different products we tried, in this post I’d like to focus on the benefits of using Drone and the steps we took to improve build times to run between 5 to 7 minutes.
On the SaaS platform, we had a package that allowed for 6 parallel builds for 6 parallel jobs. This meant that each job could run a single build on 6 different instances. To make the most of this setup, the test suite used a Ruby Gem called Knapsack which split the test suite across the set of instances. While we were able to split tests across different instances, these instances were shared resources, so they were limited to the available resource on each instance. Without splitting the job up into multiple builds, the test suite ran for ~40 minutes, but splitting it into 6 builds meant build times were ~15 minutes.
We decided performance might be better if we used a single larger instance without contention, rather than having it split across several smaller contended instances. To be able to configure RSpec to make full use of a server’s resources, we looked into a Ruby Gem called Parallel Tests, which is exactly what it sounds like.
We run our infrastructure using Amazon Web Services, so we chose a large instance that had 16 vCPUs and 32GB memory (C5.4xlarge) as a Drone agent to test against. The Parallel Tests Gem mostly worked, but there were a few problems to solve. For example we had to increase the number of databases Redis supports by default; we kept running out of Amazon EBS burst balance credits, which we solved by running Elasticsearch on tmpfs; and several instances where resources conflicted, such as hardcoded database or Elasticsearch index naming, had to be updated. We could see the clear improvement in performance and the end goal, so just had to get stuck in fixing the bugs, and eventually with great celebration, we got a clean test run.
Drone itself runs in Docker, in either “server” or “agent” mode. The Drone server takes care of queueing and scheduling jobs, authenticating users and providing a web frontend so users can view any logs or job statuses. It natively hooks into GitHub, and once you have Docker installed on the host, installing and setting up takes no time at all. State is stored in a backend database (in our case MySQL), which means if the Drone server goes away we shouldn’t have too many problems picking up where we left off.
Drone agents run the jobs themselves and I made use of Amazon Autoscaling Groups to run the agents. To keep things simple, I set up a cronjob which scaled the size of the autoscaling group up to 3 in the morning, and scaled it down to 0 overnight to make sure we’re not spending money when we’re not using the system. If we need more flexibility in the future, we should be able to use running and pending job metrics to scale up and down as required.
Drone configuration is stored in a “.drone.yml” file, which gives us the benefit of storing our configuration in version control. The configuration is a superset of a Docker Compose configuration file, which means you can configure “services” to start when the pipeline begins. One of our problems was upgrading dependent services like MySQL and Redis to later versions, but with this Drone configuration this upgrade path became significantly easier.
One thing I really like about Drone is being able to easily write plugins. Custom plugins can be used at any step, and we already make use of a couple to have a pretty notification to Slack and a compressed S3 file cache. This gives us great flexibility in how best to run our pipeline, without being restrained by any tools.
Drone also ships with the ability to run pipeline steps in parallel (further decreasing the overall build time), and matrix builds which allow you to test against different versions of software, for example Ruby.
So far, Drone has been a very stable tool. Along with the performance gains, it gives us the flexibility we need to squeak every inch of performance out of our test suite.
Using a combination of tools, we have managed to decrease the time developers wait to get feedback from their code changes. Running self-hosted, open source software means we get the best value in terms of running costs, and we have flexibility to scale up and down when we require it.
I am a huge fan of Drone. I love the simplicity of the product, and the ease and minimal resources it takes to operate. I am excited to see what improvements are on the horizon, and will be a strong advocate for its use when self-hosted products make sense.