Stealthy shipping with atomic deploys
Nick Taylor | Pinterest engineer, SRE
Here I’ll discuss the atomic deploy system, a solution to some of the challenges that occur during deployment, and a path to successful and ongoing deployments.
Introducing the atomic deploy system
We deploy software at Pinterest in waves, taking 10% of our servers offline in a batch, replacing the software, and putting them back into service. This allows us to have continuous service during our deployments, but it also means we’re running a mixed fleet of old- and new-version servers.
The atomic deploy system is our answer, which grew out of our desire to balance rapid innovation with a consistent, seamless user experience. We aim to develop our technology as quickly as possible, so this system was designed to avoid the intricate dance of backward-compatible updates. At the same time, we wanted to avoid the jarring experience of a page reloading by itself (potentially even losing the user’s context), or forcing the user to click on something to force the site to reload itself.
We hadn’t heard of other cases of doing deploys this way, which meant it could be a terrible idea, or a great one. We set out to find out for ourselves.
Managing “flip flops”
Say for example you visit the website on a Monday morning. You’re likely to view a page generated by a version of our front-end software, version “A”. If you hit reload, you’ll get pages generated by version “A”. Furthermore, we use XHR, so when a you interact with the web app, you’ll be served by dozens of requests in the background. All of these requests are powered by version “A.”
Later, you might wish to deploy “B”. Our standard model for deployments is to roll through our fleet 10–15% at a time slowly converting one web server from serving “A” to “B”.
Now a request has a chance of being served by “A” or “B” with no guarantees. With standard page-reloads this is not a problem, but much of Pinterest is XHR-based, meaning only part of a page will reload when a link is clicked. Our web framework can detect when it’s expecting a certain version and it gets something unexpected, in which case it’ll often force a reload.
For example if you go to www.pinterest.com and it’s served by A and you click a Pin, and the XHR is served by B, you’ll get a page reload. At which point you might click on another Pin which might be served by an mismatched version, which will cause another reload. In fact, you can’t escape a chance of reloads until the deploy is complete, which we call the “flip-flopping effect”. In this case, rather than browse Pinterest smoothly with nice clean interactions, you’ll get a number of full-page reloads.
Our architecture changes
When you visit the site, you talk to a load balancer which chooses a varnish front-end which in turn talks to our web front-ends which used to run nine python processes. Each of these processes are serving the exact same version on any given web front-end.
What we really wanted was a system that would only force a user to reload at most once during a deploy. The best way to ensure this was to ensure atomicity, meaning if we’re running version “A” and you’re deploying “B”, we flip a switch to version “B” and all users are on “B.”
We decided the best way to achieve this was to support serving two different versions of Pinterest and have Varnish intelligently decide which version to use. We created beefier web front-end boxes (c3.8xlarges from c1.xlarges), which could not only handle more load, but easily run 64 Python processes where half were serving the current version of Pinterest and the other serving the previous. The new and old versions were backed behind nginx with a unique port per each version of the site being served. For example, port 8000 might serve version “A” on one host, and port 8001 might serve version “B”.
Coordination and deploys
In order to inform Varnish what we should do, we developed a series of barriers, which tell Varnish what version to serve and when. Additionally we created “server sets” in ZooKeeper that let Varnish know which upstream nginx are serving.
Let’s imagine a steady state where “A” is our previous version, “B” is our current version. Users can reach either version “A” or “B”, and within a page load, they will always stay on either “A” or “B” and not switch unless they reload their browser. If they reload their browser they will get version “B”.
If we decide to roll out version C we do the following:
- Through ZooKeeper we tell Varnish to no longer serve version “A”.
- Varnish responds when it’s no longer serving version “A”.
- We roll through our web fleet and uninstall “A” and install “C” in it’s place.
- When all the web has “C” available we let varnish know that it’s ok to serve.
- Varnish responds when all the varnish nodes can serve “C”.
- We switch the default version from “B” to “C”.
By using these barriers, it’s not until the second step that people who were on “A” are now being forced onto “B”. At step 6 we allow new users to be on “C” by default, and users who were on “B” stay on “B” until the next deploy.
A look at the findings
The absolute values are redacted, but you can see the relative effect. Note the dips correspond with weekends, which is when we tend not to deploy our web app. In mid-April, we switched completely to the new atomic deploy system.
We found that the new atomic deployments reduced annoyances for Pinners and contributed to an overall improved user experience. This ultimately means that deploys are stealthier and can we can reasonably do more deploys throughout the day or as the business might require.
Nick Taylor is a software engineer at Pinterest.
Acknowledgements: Jeremy Stanley and Dave Dash, whose contributions helped make this technology a reality.