Saving 60 engineer hours a day

Pinterest, at least until 2014, had one giant repository containing its web, API and a few other core services. In it existed a solitary tests/ directory which contained thousands of unit tests. Many of the tests were fast, but altogether it could take upwards of 30 minutes to run through the unit tests.

I experimented with a few approaches to make this faster. We used nosetests, a popular test-runner for python, to find and run our unit tests. It had some a multiprocessing plugin, but I wasn’t happy with the results and felt I could do better.

I also experimented with tmpfs but there wasn’t a huge savings of running things in RAM vs disk. I’d possibly consider revisiting this again if I needed to eke out some performance gains.

A strategy that I did like was to have a master process (in Jenkins) farm out some work to a bunch of worker jobs and then aggregate the results. This took our files under the tests/**, sorted them, and then split them into 20–25 subjobs. When this worked it was fast, but eventually we’d break our GitHub Enterprise server because we’d have 25 simultaneous git operations for each running job.

We had a handful of Jenkins servers so these jobs could be run on any number of servers and then brought back to the master job to be aggregated. I learned that you could pin (this was Pinterest after all) a worker job to the same server as the master job. In order to run so many jobs I used my favorite Amazon EC2 instance, the c3.8xlarge. It had an SSD and a whopping 32 cores. Each of those 32 “CPUs” could effectively run a Jenkins executor. By doing this, and then using a shared workspace, we were back in the realm of every job taking exactly one git pull/clone operation.

A lot of the implementation was handled in Makefile tasks. I had a lint task, and then I had a “run some of the tests” task. This made light work of telling all the workers what to do. Each of the workers could fail, and then the master job would fail which would in turn tell the engineer that her code had some issues.

With all dev-tools work, the best part about this task was everyone’s reactions. I started whitelisting a few engineers to use this newer, faster process (and it helped us collect constructive feedback, instead of angry grumblings). Eventually everyone wanted to use this newer method, so I opened it up to everyone. When I left Pinterest most tests still ran quickly. If you run your own testing infrastructure and have the flexibility to throw more machines at a problem, do it. Your engineers will love you.



DadOps 24/7 and DevOps Consultant. Formerly @Pinterest and @Mozilla

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store