Test In Production

6 min readAug 21, 2023

Stop pretending non-production environments are production.

Person repairing a plane while flying in it

At every company I’ve ever worked at in my 10+ year career, a variation on the following statement has been made by developers: The development environment sucks! It doesn’t represent production! We can’t test there!

For a development environment to be successful, the following things need to be true:

The shape of data flowing through it matches production
The volume of data flowing through it matches production
Each of the third party integrations exists or are 100% mocked

These are never true. Ever. Non-prod environments never match production. It may be that your company has solved this; if so, let me know how because I’d love to hear about it. But I have never seen it solved. The closest I’ve ever seen to a perfect prod environment match was at Mission Lane. Some teams there were pushing continuous traffic through a staging environment using k6. Even then, it was only a subset of teams, a subset of calls, and not the same volume as production.

I have watched company after company spend entire teams' time and energy on “improving test data quality.” I’ve yet to see them be happy with their test environment at the end of the day.

So, instead, we, as software engineers, should give up the ghost. We should stop pretending that lower environment testing is good enough. And instead, we should focus on production engineering (note: I don’t strictly mean Meta’s definition of production engineering). Testing in lower environments mostly serves to provide a smoke test. Any more than that is a false sense of security.

What Is Production Engineering?

Production Engineering means engineering how you release code to production. Production engineering is about building systems and tools that allow teams to safely release software to production without worrying about breaking something. Allowing engineers to get code out without impacting customers easily and without fear of their changes wreaking havoc on your bottom line.

Software Engineering is a tug of war between wanting to release new features to make money and not wanting to change anything because it works and makes you money. Production Engineering stops that from being a choice.

Production Engineering should be a key focus of an internal developer platform. Or your Cloud Platform team. Or your SRE team. Whoever is in charge of writing developer tooling/maintaining your developer platform.

Instead of having a “testing team” or a “test data quality team” or a “testing environment maintenance team,” you should spend the energy and effort of your engineers on building automated release processes that make it safe to release code to production.

Examples Of Production Engineering

It’s easy for me to sit here and tell you to test in production. It’s far harder to make a convincing case to your business leadership. Testing in lower environments feels safer. And it makes intuitive sense to leadership. Testing in production? To quote a former software architect: “You don’t engineer the plane while you are in the air en route to your destination.”

Here are some concrete examples of how to test in production:

Canary releases. A canary release is the process of slowly shifting an increasing amount of your traffic to a new version of your app. This can be done based on raw request traffic percentages, cookies/headers, etc. If you are running in Kubernetes, there is a fantastic tool called Flagger which will do automatic canary releases based on metrics. It will route a percentage of traffic to a new version/deployment of your service, steadily increase that percentage until a pre-defined percentage, and then promote your new version to take 100%. An alternative to Flagger is Argo Rollouts. If you aren’t using Kubernetes, tools like Spinnaker or API gateway can help with canary releases.
Tiered Releases. A tiered release means releasing a new version of your software only to specific instances. Similar to a canary release, but on a broader scale. Think on a per tenant or per region tier. This is particularly helpful for SaaS companies. Being able to roll out your software first to customers in Group A, then Group B, and so on means that you limit the blast radius of a change going wrong. Combining canary releases and tiered releases means that you can protect each individual client with a canary release and protect the entire SaaS environment by not having to respond to failures in all customers at the same time
Feature Flagging. Feature flags allow you to hide code paths until they are ready. Feature flags, more importantly, allow you to separate the code release from the business release. This means your engineers can release the code, smoke test, test their code against production, and then slowly ramp up the usage. This also means code can be shipped when it is completed and activated when the business side decides it is ready (allowing for decoupling software deadlines from marketing or business deadlines). Feature flags also force you to write backward-compatible code, especially database code. By forcing you not to replace old code but write a new code path, you have to ensure continued support and prevent breaking changes from making it to production.
Automated Rollbacks. When a feature flag is enabled, a tiered release or canary release happens, something should be measuring known SLIs to ensure they don’t go beneath acceptable levels. If they do, the system should automatically fail and roll back the release. A human should also have the option of manually rolling back a release. Still, there should be an automated system measuring how your software is performing and rolling it back if you have performance regressions.
PR Automation. Ensuring that bad code never makes it master is by far the easiest method of preventing issues. Most companies have some form of CI and linting for PRs, mostly focused on code styling or perhaps requiring unit tests. But you can expand this further if you want to. Does your SQL get linted? It should. Use a tool like Squawk or SQLFluff to ensure your database doesn’t have bad migrations run against it. This can be extended as far as you’d like. Most rules and checks a human does on code, especially in a PR, should be automated. Let humans check for intent rather than a checklist.
Known Good Production Traffic. Use a tool like k6 to ship known good traffic through production. This has the following effects: it ensures that traffic levels are always acceptable to measure against, forces engineers to know how their code is being used, allows you to write monitors against known good traffic, and forces you to think about API design/dog food the API you wrote.
Outcome-Based Alerting. Unlike outcome-based decision-making (which we shouldn’t do), outcome-based alerting allows you to define general alerts for engineers to respond to when it matters. Business doesn’t care if the number of exceptions increases as long as it doesn’t impact the customer experience. Customers don’t care if you see OOMkills if it doesn’t impact their user experience. Alert on either latency or apdex. Alert on failed requests. Use the golden signals. Customers don’t even care if you have rollbacks. Anything but the golden signals, or your SLIs, cause the symptom, and while you should have metrics and dashboards for causes of symptoms, only alert if it is actionable. This makes it so engineers don’t artificially stress about releases.
Contract Tests. Ensure your APIs are backward compatible without running expensive integration tests. Ensure engineers are notified when they break something before it gets to any environment. Don’t rely on others testing your code in a lower environment to tell you if you broke your contract.
Mirroring Production Traffic To A Lower Environment. Instead of trying to generate your own traffic, mirror the traffic at the network layer to a separate environment that can be used for testing changes. This lets you see how your code performs in the real world without impacting your customers. The downside is cost; however, my experience is that dev/staging costs more than prod anyway, so this shouldn’t be a significant factor.

Conclusion

Test in production. Stop being afraid. There is no environment like production. That’s where your software delivers value. I’m not advocating for the removal of unit tests. I’m not arguing that we should eliminate dev/staging environments. Those can help show UI changes or ensure the software boots properly, perhaps allowing people to comment on design/user experience, etc. There are valuable outcomes from non-production environments. But they aren’t testing if your software is safe for production. That’s only a statement that can be made after the software has been released to production. So instead of trying to improve test data quality, or making your non-prod environments more production-like, spend it on improving your release pipeline and production tooling. Use community tools to automate your releases. It’ll have a far greater reward when you have to scale.

Test In Production

What Is Production Engineering?

Examples Of Production Engineering

Conclusion

Written by Sven Hans Knecht