Why you should be A/B testing your infrastructure
John Egan & Andrea Burbank | Pinterest engineers, Data & Data Science
The benefits of using a data-driven approach to product development are widely known. Most companies understand the benefits of running an A/B experiment when adding a new feature or redesigning a page. While engineers and product managers have embraced a data-driven approach to product development, few think to apply it to backend development. We’ve applied A/B testing to major infrastructural changes at Pinterest and have found it extremely helpful in validating those changes have no negative user-facing impact.
Bugs are simply unavoidable when it comes to developing complex software. It’s often hard to prove you’ve covered all possible edge cases, all possible error cases and all possible performance issues. However, when replacing or re-architecting an existing system, you have the unique opportunity to prove that the new system is at least as good as the one it’s replacing. For rapidly growing companies like Pinterest, the necessity to re-architect or replace one component of our infrastructure happens relatively frequently. We rely heavily on logging, monitoring and unit tests to ensure we’re creating quality code. However, we also run A/B experiments whenever possible as a final step of validation to ensure there’s no unintended impact on Pinners. The way we run the experiment is pretty simple: half of Pinners are sent down the old code path and hit the old system and the other half use the new system. We then monitor the results to make sure there’s no impact across all our key metrics for Pinners in the treatment group. Here are the results of three such experiments.
2013: A new web framework
Our commitment to A/B testing infrastructural changes was forged in early 2013 when we rewrote our web framework. Our legacy code had grown increasingly unwieldy over time, and its functionality was beginning to diverge from that of our mobile apps, because it ran through completely independent code paths. So we built a new web framework (code-named Denzel) that was modular and composable and consumed the same API as our mobile clients. At the same time we redesigned the look and feel of the website.
When it came time to launch, we debated extensively whether we should run an experiment at all, since we were fully organizationally committed to the change and hadn’t yet run many experiments on changes of this magnitude. But when we ran the experiment on a small fraction of our traffic, we discovered not only major bugs in some clients we hadn’t fully tested but also that some minor features we hadn’t ported over to the new framework were in fact driving significant user engagement. We reinstated these features and fixed the bugs before fully rolling out the new website, which gave Pinners a better experience and allowed us to understand our product better at the same time.
This first trial by fire helped us establish a broad culture of experimentation and data-driven decision-making, as well as learn to break down big changes into individually testable components.
We rely on an open-source library called pyapns for sending push notifications via Apple’s servers. The project was written several years ago and wasn’t well maintained. Based on our data and what we’d heard from other companies, we had concerns about its reliability. We decided to test out using a different library called PyAPNs, which seemed better written and better maintained. We set up an A/B experiment, monitored the results and found that there was a 1 percent decrease in our visitors with PyAPNs. We did some digging and couldn’t determine the cause for the drop, so we eventually decided to roll back and stick with pyapns.
We’ve slowly been moving towards a more service-oriented architecture. Recently we extracted a lot of our code for managing users and encapsulated it into our new UserService. We took an iterative approach to building the service, extracting one piece of functionality at a time. With such a major refactor of how we handle all user-related data, we wanted to ensure nothing broke. We set up an experiment for each major piece of functionality that was extracted, for a total of three experiments. Each experiment completed successfully showing no drop in any metrics. The results have given us strong confidence that this new UserService is at parity with the previous code.
We’ve had a lot of success with A/B testing our infrastructure. It’s helped us identify when changes have caused a serious negative impact that we probably wouldn’t have noticed. When they go well, they also give us the confidence that a new system is performing as expected. If you’re not A/B testing your infrastructure changes, you really should be.
John Egan is a growth engineer and Andrea Burbank is a data scientist
Acknowledgements: Dan Feng, Josh Inkenbrandt, Nadine Harik and Vy Phan for their help in running the experiments covered in this post.