Pitfalls of Progressive Delivery
Testing in production, or the next step to safe releases?
Progressive Delivery is the technique of rolling out software to users incrementally to improve the safety and reliability of the deployment process. Relying on automated testing to catch every issue before it hits production is like abstinence-only sex education — both are doomed for failure in the real world. Progressive Delivery acknowledges that mistakes will happen no matter how hard we try to prevent them. If we’re going to accidentally release bad software, let’s at least do it carefully.
When done right, Progressive Delivery can reduce the risk of a software deployment while keeping release velocity high. When done poorly, it can complicate your rollout and rollback processes, degrade user experience and slow down your development team.
I’ve been working in the deployment space for about ten years now, and have seen software delivery fail in more ways than I can remember. This post contains some lessons I’ve learned over the last decade, so you don’t have to make the same mistakes I did.
Testing in production poses real risks to your infrastructure and users, so the typical strategy is to do it slowly on a small percentage of traffic, then increase this gradually as confidence builds. Having a few requests fail is much better than going down for everyone.
This works great, however there’s one big downside — it can take a long time to become confident in your release. If you don’t have much traffic it can take hours to gather a statistically significant data set. If you have a complex high QPS application, it can still take awhile to make sure all code paths have been exercised.
This means your fast release cadence has to slow down. If you could push a release minutes after the commit was merged before, you might be looking at hours or days to roll it out progressively. This can complicate your entire release pipeline. Single-change deployments are the easiest to debug and rollback — there’s only one change so it’s easy to find the culprit. But with slower deployments, you probably need to batch together multiple changes to still get things out in time.
Faster deployments lead to more productive teams. You’ll need to carefully balance how quickly to progress your rollouts with the velocity of your development teams. Pick what works best for your organization.
An alternate dimension
When monitoring a progressive rollout, you typically monitor metrics that fall into three categories:
Stability. These signals let you know your application is still working and stable. This includes things like error reporting services, crash logs, restart rates, and other signals that show you if your application up and running or crash-looping.
Performance. These let you know how fast your application is and how many resources it is using. If memory usage doubles between releases and you don’t know why, you should probably stop and investigate before continuing.
Business. These let you know if your application is still meeting the business goals. If you’re serving ads, have click through rates changed significantly? If it’s an end user app, have completed actions stopped? These can build on top of stability and performance metrics, and represent end user success.
Measuring each of these correctly in a small percentage of traffic can be non-intuitive. For example, if you’re tracking performance, caching might skew your results.
Many applications follow a pattern where the first request performs an expensive operation to calculate a value and cache it for future requests. This amortizes the expensive operation across all the requests that use it. If you’re only sending a small percentage of traffic, you won’t get to spread the load as far making it look like your performance is worse than it is.
You can resolve this by progressing the rollout on the same dimension you perform caching on. If you cache at a user level, turn something on for a percentage of users instead of a percentage of requests. Your cache hit rates should remain steady throughout the rollout.
You also need to make sure you’re comparing apples to apples. Many applications crash loop at the start, or show a higher error rate during initialization. This isn’t ideal, but usually isn’t a big enough problem to be worth fixing. If you’re comparing the error rate of an application instance that is still “warning up” to one that has been running for a long time, you might be looking at bad data. Let your new instances warm up fully before comparing, or even better — start up a new instance of your old version at the same time as your new one for a perfect comparison.
The goal of progressive delivery is to give your changes some validation in the most realistic environment you have: production. If you’re going to be testing in prod, then you should at least be looking at the results, right? Unfortunately this form of testing does not give you a binary pass/fail answer, so it can take some interpretation to decide if your rollout is ready to continue. You have to gather metrics and then compare them to the old version. Both steps here can be tricky to get right.
One common pitfall is to try to progress before you have enough data to compare. Scientists are familiar with the concept of a “statistically significant sample size”, and now DevOps engineers need to be as well. Depending on how much traffic you have and how variable your metrics are, you may not be waiting long enough to make your decision.
The goal here should be to wait the minimum amount of time to keep productivity high while still ensuring you have a statistically significant sample of data to minimize false positives. There are various techniques for doing this analysis depending on your data format and shape. Spinnaker uses a technique called the Mann-Whitney U test, which basically tells you the probability that two samples were taken from a source with the same underlying distribution.
However you evaluate your data, make sure you’re collecting enough of it to be meaningful. If you skip this step, it’s like running tests and ignoring the results. You’d be better off dropping the complexity of progressive delivery all together.
Stay off the hammock
One very common mistake teams make after successfully adopting Progressive Delivery is to rely too heavily on it. Teams realize that their release process is capable of detecting issues before they hit too many customers, so they get lazy. Integration tests are hard to write and boring, let’s just try it out on a few users and see if it works.
This will come back to bite you. Progressive Delivery is a safety net, not a hammock. Cancelling an in-progress deployment is better than rolling one back later, but it’s much more expensive than catching the issue before the release is cut. The breaking change isn’t the only one that needs to be rolled back — every change in the release is now on pause because something was missed in testing.
Just like the book Mythical Man Month said 40 years ago, the cost of fixing a software defect increases dramatically the later it is found. Extra validation in production is not an excuse to cut corners testing during the rest of your lifecycle!
Progressive delivery is a backup parachute, not a set of bumpers at a kid’s bowling birthday party.
Every issue that causes a deployment to halt should be treated as a production issue where you got lucky. That means a full (blameless) postmortem should be conducted to prevent this same class of issue from happening again.
Walk before you run
Progressive delivery is complicated and not for everyone. If your production environment and release process are not already in great shape, trying to implement progressive delivery might cause more harm than good.
Remember that during a progressive deployment, you have multiple versions of your application running simultaneously — potentially for hours or even days. If your app is completely stateless this isn’t too big a deal, but as soon as state becomes involved this gets tricky fast. Your code and database must not only be forward and backward compatible, but you have to consider “race conditions” in which data is read or written by multiple versions of your app at the same time.
Problems like this one are difficult to debug in production, and rollouts are very hard to simulate in testing. You need to make sure your prod environment is observable enough to help you detect these problems as they happen. Can you trace a single RPC all the way through your production stack? If not, you have no idea which version triggered the bug your user hit.
Rushing into a complex progressive delivery setup before you’ve setup the basic fundamentals is a recipe for disaster. In the hierarchy of DevOps needs, progressive delivery is relatively far away.
At the bottom you have production observability. If you do not have the ability to debug production issues, stop reading this and go fix that immediately. This is the most important thing any DevOps engineer or team needs.
In a very close second place, you need good release and development hygiene. Unit testing, code review and integration testing should prevent most bugs from reaching a production environment. A healthy postmortem culture will help prevent the ones that do from reoccurring. The only reason this is in second place is that no process can prevent all bugs. Some will eventually make it to prod no matter how hard you try.
Finally, after you have these two down solid, you can consider implementing progressive delivery and other advanced techniques. As a litmus test, you should answer yes to the following before trying out progressive delivery:
- Are you comfortable releasing on a Friday?
- Can you trace a request through your entire production environment?
- Do you have defined SLOs? Do you know if you’re meeting them?
- Are you confident in your rollback mechanisms? Do you prefer to rollback then debug to rolling forward?
This last pitfall is the reverse of the previous one. I’ve seen many teams or organizations with great production hygiene struggle and ultimately give up when trying to implement progressive delivery in non-traditional environments.
What do I mean by non-traditional? Anything but backend, server software. It’s true that things like mobile apps, client-side tools, desktop applications and even browser-based single-page-apps can be more challenging than server software, but that just means you need to get a little creative.
One common technique is to use feature flags to simulate progressive delivery. In this approach, the code for a new feature is released globally — perhaps in a new version of software your users install locally. But the new features are disabled by default. During initialization, the software “phones home” to check what features should be enabled for this installation. Then, your server can decide to turn things on one user at a time.
If this doesn’t work, other pieces of software make the process even more manual with different “release tracks”. Some of your more risk-tolerant users get added to a “beta” release track, and get access to features before the rest of your user base. This is not a new technique by any means, but is still a form of progressive delivery! You might even want to give your risk-wary customers access to these programs, so they can perform their own testing against your new releases.
I hope this post covered some of the benefits of progressive delivery, while also providing realistic guidance on how to implement it successfully. If you pay attention and plan for the pitfalls outlined above, your organization should be able to keep release velocity high while improving your SLOs by catching issues before they hit all users.
If you’d like to learn more about these techniques, this excellent talk by Carlos Sanchez at the Continuous Delivery Summit contains a lot more guidance. James Governor from RedMonk also gave a great talk with lessons learned at QCon.
I would like to think Kim Lewandowski and Rajeev Dayal for reading drafts of this.
Let me know if you liked this post or have any questions on Twitter!