What to do if your test environment is sick

Published in

Cloud Workers

9 min readMay 11, 2023

If the test environment of your IT delivery is not doing well, it leads to unstable releases and delayed delivery times. Overdue delivery dates lead to more stress, deadline pressure, and overtime for your employees.

The first colleagues leave the project, and essential business and technical knowledge must be recovered. A downward spiral develops, which turns faster and faster. Escalation up to management level is inevitable.

What could be the reason for this? Which problems repeatedly appear in teams, departments, and organizations? Which immediate and medium-term solutions can help our IT delivery?

In this article, I will give three solutions to simplify life for you and your team.

Once upon a time, there was a well-designed Testland!

Figure 1: At least three pre-productive systems should be part of your build, test and deployment pipeline

In an ideal world, exist at least three different pre-productive and one production environment. Each environment has its purpose of evidence:

The development environment is used for unit testing, component testing, and rapid exploratory development testing. Likewise, the development environment can be used to introduce new features to internal stakeholders.

If tests pass successfully, the system under test can be moved to the next environment to process further tests, like integration, load, or performance tests. The closer a release gets to the production environment, the more mature and hardened it is.

Unfortunately, we often do not live in an ideal world

Monday morning! An error in our production environment has occurred! Nothing works anymore! The phones at our first-level support are ringing off the hook! After a short analysis, the error is found, and a hotfix is developed. This hotfix must now be tested in the test environment before we can move it to the production environment.

Unfortunately, we only have one test environment in which the next production release is currently in the testing and maturing phase. This results in blocking the deployment of our hotfix.

What to do now?

Figure 2: If only one deployment can be installed on the test environment, important hotfixes for production cannot be applied. This creates a dangerous bottleneck.

Rescheduling is necessary. Organizational overhead arises, and the test of the new release has to be canceled altogether. This shakes up the test planning, and bottlenecks are pre-programmed. An important aspect that can lead to delayed delivery dates!

There can be several reasons for having only one test environment in the build and deployment pipeline. For example, there is only one historically grown test system that is fully usable and meets the conditions of test readiness.

It may also be for reasons of organizational structure or deeply established processes among your team members. Their work methodologies and practices have been deeply established, and of course, you've surely also heard the statement: “We’ve always done it that way” or “Never change a running system.”

With the above statements, you should listen more closely to what the real reasons are. It can also be due to a lack of work capacity or a lack of technical knowledge within the team.

It may also be the case that another test environment has already been set up, but there is not enough plausible data for testing purposes, and the environment is not connected to all important surrounding systems. In summary, this additional test system does not meet the requirements regarding our test readiness requirements.

Again, if only one test environment is available, only one system installation can be run under test (see Figure 2).

Your installation is too slooooow!

In past projects, I have experienced that the installation process can sometimes take more than a day. Especially, if the installation process of a complex system needs to be done mostly manually, it takes a long time. You can expect even more downtime when the new installation does not work and throws errors.

If the system is one of many among the call hierarchy of a request, the failure leads to problems and errors in the surrounding systems. This can lead to extensive error analysis and evaluation as well as increased organizational and administrative effort. In the worst case, the whole test environment becomes unusable for several systems.

All this leads to increased costs, not only during the outage and error analysis. At the same time, the DevOps team is tied to these tasks, which can lead to the delay of further development of important new features of your product.

If such processes take too long, it can be a sign of a lack of automation.

The development environment is irrelevant

One of the main reasons for an unusable development environment is the lack of plausible data or surrounding systems.

De facto, the development team deploys new changes directly to the test environment. This leads to continuous deployments and the unstable availability of a supposedly stable system.

Figure 3: One of the dependent peripheral systems is not accessible (see Dependency E) and leads to a cascading effect. Parts of the test chain cannot be executed and the system under test must wait.

The system is less available and more frequently unavailable for dependent surrounding systems (see Figure 3). When it comes to the actual test, it cannot be performed or even fails. Important time, nerves, and energy must be invested to bring the test environment back to a testable state.

If one of the systems within a complex application landscape of dependent systems fails again and again, chaos, stress, and frustration are pre-programmed for all participating teams. These random disruptions can also be an important reason why team members leave the project or the organization.

Which strategies should we use to get out of this dilemma?

The following three solutions can lead to greater stability and availability of their own systems, significantly simplifying the lives of your team and the teams of your surrounding systems.

1. Heal your inner self

On top of the one and only test environment, a separate sub-deployment of the system needs to be installed. This deployment must be a more stable and hardened version than the one currently under test. In the best case, the current release from the production environment is used here (see table 1.0.0 in Figure 4).

Figure 4: Your consumers can now choose between alpha and stable

Versioning can be used via API interfaces or DNS to forward stable or unstable versions. This simple trick allows peripheral systems to decide which system they want to use. In parallel, you can always install and deploy new releases and set them under test, exposed as different versions.

API endpoints of different versions in the test environment could look like this:

# Example via endpoint path
GET https://www.my-testenvironment.com/1.1.0/products
GET https://www.my-testenvironment.com/1.0.0/products

# Example via DNS-Entry
GET https://alpha-1-1-0.my-testenvironment.com/products
GET https://stable-1-0-0.my-testenvironment.com/products

This significantly improves the stability of your system. It also increases reliability and availability about the surrounding systems. This will make the life of your consuming systems a lot easier.

2. The list of digital surroundings

Numerous surrounding systems are usually required for integration or final tests to be accepted by the business department. It is essential that these systems be available during the test period planned for this purpose. And, of course, to get the final go for staging in production.

Figure 5: A part for testing readyness can be achieved automatically via the digital surroundings list

One way of checking the availability of your surroundings is to create a list of peripheral systems. In the first step, this can be a simple list in Excel in which the following aspects are stored:

name of the system and its identifier,
important contacts in the event of an emergency,
criticality in case of failure,
information about the stored data,
availability and maintenance times,
Information on integration (incoming or outgoing system, data format, communication type/protocol, standard and peak times of the data flow)

Shortly before we start each test phase, we run through and check our peripheral systems. If some systems are unavailable, depending on the criticality, the test will stop at that point, and the responsible persons will be contacted directly afterward. Now we wait for feedback.

The functionality of the list can be further expanded:

For example, it would be helpful if the responsible parties were automatically contacted in a specific case. The message should also contain a “call to action” to obtain quick feedback.

Possible responses could be: “The system will be available again in 5 minutes” or “We are working on a bug." The system will be available again in XYZ minutes/hours”. This workflow can be set up via modern communication tools, such as Microsoft Teams, or, more classically, via email.

A faster time to market is essential to responding to the needs of our customers!

Generally, it is always good to continuously minimize the time and effort of all involved parties. Such a semi-automated workflow should only be the first step in this direction.

Another feature could be to inform the teams of your surrounding systems at an early stage about the test phases of your system in the hope that the availability of their systems is guaranteed on the day of test execution.

Once such a digital list has been set up, it should from now on become a fixed part of the test readiness and be automatically integrated into your build, test and deployment pipeline.

At a certain point, you should consider the use of tools such as PACT Contract Testing¹.

3. The supreme discipline

Once the build, test, and deployment pipeline for new releases has been fully automated, you can consider switching to feature-based deployments of your system.

Figure 6: Just a couple of minutes can pass between myFeature and myNextFeature until both are deployed to the our production environment.

Instead of big-bang releases, every 3–4 months that require intense organizational and administrative effort, releasing new functionality in small features can lead to significant improvement and facilitation between all stakeholders.

Regression tests cover the existing functionality of the previous release. These represent the reassurance of our system. The test road is now expanded to include additional important feature-based test scenarios.

If these run successfully, there is nothing in the way of a release in production. In production itself, Canary releases² can then be used to successively switch to the new feature. For example, by incrementally shifting 20% of incoming requests to our new release. If there is an increased number of errors, it is possible to switch back to the previous, more stable version without much effort.

If the new feature has to wait for surrounding systems that are not yet in production, feature toggles³ can be used. Using these, individual features can be activated or deactivated. In this way, all existing functionalities can be run through our test pipeline and, if successful, released for production.

Another advantage is the very short timeframe between implementation, running the test pipeline, and going into production. If errors occur, they can be fixed quickly since the knowledge about the implementation is still very fresh within the development team.

The term “feature-based delivery” is similar to “release on demand” in the SAFe context. In summary, it means that a team should be able to deliver a new version of its application any time there is a customer or market requirement.⁴

Focus on the value side of life

To enable rapid automated deployment of feature-based releases, your test pipeline must be fully integrated into the build and deployment road. Effort and duration can vary depending on the situation and project context on the customer side.

It also requires a mindset shift within the team and an active desire to continuously change, challenge, and improve. Furthermore, handover processes between teams and departments in the organization should also be critically questioned and discussed.

To stay fit, recurring inspections and adaptations are required!

The change is worth it!

It leads to less chaos, stress, anxiety, and frustration in your department. A win-win situation for everybody within the team with a positive impact on the customer. With continuous improvements, your team has more space and time to focus on the actual values of the system.

¹For more information, please visit: https://docs.pact.io/
²For a detailed description of Canary releases, please visit https://martinfowler.com/bliki/CanaryRelease.html or read the book about Continuous Delivery by Jez Humble and David Farley.
³For a detailed description of feature toggles, visit https://martinfowler.com/articles/feature-toggles.html or read the book about Complete Guide to Test Automation by Arnon Axelrod
⁴The following page provides more information on this topic https://scaledagileframework.com/release-on-demand/