Parallel Test in Production

Why put your system in production early

--

I know, I know. You must be jumping now. 😂 Testing in production a decade or two ago was a frequent joke, something that many would never do. “It will be a disaster!” they said. 💀💀💀 Today some still think putting the system into production early is unnecessary and unrealistic, especially if the system is not going live until much later.

Here are some resistances that you might encounter:

  • e.g. Project Managers may say, “Do more testing in test environments; the system will not be used by anyone yet. There’s no need to deploy to production early; let’s wait.”
  • e.g. Product folks may say, “Let’s focus on building more features and creating more tests to validate. There’s no need for additional overheads and loss in productivity.”
  • e.g. IT Directors may say, “You can sanitise the production data before using them in test environments. The efforts are minimal; no need to take on additional risks”.
  • e.g. Security folks may say, “You could expose undiscovered vulnerabilities to the public, making it easier for potential hackers to compromise your whole system.”
  • e.g. Operations folks may say, “You could leak sensitive production data.”

I could go on about these legitimate concerns, but I believe you get the point by now. 😂 While I won’t delve into these concerns in this article, I want to emphasise the significance of conducting parallel tests during legacy system migration.

Testing is critical in any modernisation endeavour and typically accounts for over 50% of modernisation efforts. Although key business logic is retained, the underlying applications are rebuilt to work in a new operating environment. Thus, detailed validation is necessary; thorough testing is the best way to achieve that validation. And traditional testing methods are essential, but they can sometimes fall short. Hence, we have to devise different ways to ensure our systems are reliable. Different ways can have certain risks, but nothing isn’t a trade-off.

Again, this article is not about:

  • using production data in test environments
  • doing a parallel run (operating old and new systems at the same time)
  • testing in a live production environment

It is about:

  • dogfooding — you can generalise it to using a pre-release product in a production environment
  • doing parallel tests in production where your system is designed to work against actual production data

Now picture this: You’ve invested a significant amount of time, resources, and money into the products, services, and processes to replace legacy systems. You tackled the high technical risk of data transfer and migration from legacy systems to new open systems. You discovered, learned, and interpreted the requirements and designs from the legacy system with sheer effort. You managed to rebuild, modernise the system and set everything up, and you even got help from the existing staff developing and maintaining the systems, but ALAS! The system isn’t running as it should — yikes!

How could you have prevented this?

1. Identify meaningful milestones and invest in structuring interim outputs in your development plan.

From the beginning, robust delivery plans should have milestones that indicate layers of fidelity for each of the significant blocks of work such as demo ready, fish food ready, dog food ready, and launch ready.

Good plans should also include interim outputs. The work is unavoidable anyway — face problems head-on early; otherwise, incur the mad rush of firefighting in the later stages of development. Discovering problems too late increases the risk of late delivery exponentially.

“If you’re not willing to eat your own dog food, why should your customers or users try it?”

2. Put these interim outputs into production and validate them as early as possible.

The development team can read decades of past documentation and invest lots of time in requirements studies, but these requirements can never be exhaustive.

Testing can never be flawless. According to the law of diminishing returns, after a certain point, the significant amount of effort put into identifying all potential bugs results in less impactful outcomes. Additionally, predicting how production data may be unclean or have edge cases is impossible. Testing in production is crucial to ensure the system is functioning as intended. Once deployed, you’re testing complex systems of users, code, environment, infrastructure and many more. These systems have unpredictable interactions, ordering, and emergent properties that defy your ability to test deterministically.

You may run into some silly low-hanging fruit when you first put your system into production. You’d tackle them in a controlled environment during regular work hours, and you know what you just did to the system. After doing this frequently, your team will learn, and systems will be more resilient.

Why is it so crucial for legacy systems?

Below are some discoveries from our new system’s parallel test with the legacy systems in production, where we compare and validate the claims outputs and results.

A very simplified sketch of the parallel test of our systems

1. Legacy systems do not have up-to-date documentation.
Our new system rejected numerous claims due to validation issues. This was because the claims’ format submitted by downstream integration partners differed from the specifications provided. As a result, much time was spent clarifying and investigating the cause of the issues. Timeboxing was used to avoid spending too much time stuck in making decisions.

2. Legacy systems could only confirm unique system handling via the existing code base.
When you realise the results seem wrong, don’t be surprised that nobody could point you in the right direction and that you’d have to rely on the legacy system’s source code as a source of truth. Again, it could be due to outdated or non-existent documentation, data manipulation, overtime system modifications or patches. These clarifications take time and may result in significant downstream rework, which could have been avoided if identified early.

3. Data used in legacy systems could be of poor quality or mismatched with the new system’s data.
When migrating old data from legacy systems, it’s possible to encounter invalid data in legacy systems. This creates compatibility issues with new systems and increases the risk of integration problems. Correcting or fixing these poor quality or mismatched data can be a time-consuming process and may even require exceptional handling and changes to the new system's design.

4. We discover unknown unknowns.
The teams from the legacy system may have an idea of its function, but may lack an understanding of its inner workings and purpose. Despite our confidence in our new system’s capabilities, we encountered unexpected issues that required new requirements and alterations to our system’s design to cater to the situation.

5. Unable to simulate issues with mocked data in test environments.
There are some cases that, even if we attempted to replicate issues from production, certain elements might have non-exact data, different configuration options, or data from partnering systems that we weren’t expecting. Also, it would be hard to mock out parts if the legacy system has poor or no code encapsulation.

Therefore, we want to identify these issues and discrepancies as early as possible! It should be noted, highlighted, and then categorised based on the cause of the error. Some of the common causes of discrepancies can include the following:

  • Input entry/setup errors
  • Errors attributed to … (i.e. rules, network flakiness)
  • Errors from the old legacy systems
  • Explainable differences (i.e. rounding errors)
  • Unexplainable errors

It’s better to test risky things in small, frequent chunks that can be contained rather than avoid them altogether.

Testing in production is part and parcel of embracing the unknowns and shifting our mindset to a world where things will break, and that’s alright. In fact, it’s better and beneficial. Detecting these issues early on and occasionally experiencing minor setbacks helps us level up, stay sharp and remain agile. Errors and failures serve as humble teachers, enabling us to learn quickly and fix fast.

--

--

Sylvia Ng
Government Digital Services, Singapore

Adventure lover in the realm of design and tech. Talk to me about #agile, #productmgmt, #ux, #people, #process, #culture. sylvia.substack.com