Bring Chaos Engineering to Your Legacy CD/CI Pipeline

Published in

The Humans of DevOps

4 min readMay 18, 2020

In an increasingly distributed world we often ask ourselves if continuous delivery can accommodate legacy software systems and if we can use chaos engineering to improve reliability in these environments. There is frequently an assumption that legacy can’t be agile or an attitude that its future is uncertain but sure to be short. But nobody knows how short and it could be years or even decades before this type of technical debt can be worked away. Legacy is heritage, is cherished and often represents core business systems in an organization; cash cows that the business depends on, that have become highly complex during their existence and as such are incredibly difficult to replace and cannot just be abandoned.

Can Chaos Engineering Help with Legacy Systems?

Not everyone is a unicorn, born on the web FANG and even these organizations carry some legacy although they may not find it as crippling as many older enterprises do. Legacy applications carry with them legacy databases built on technologies and there’s a good chance any organization has to deal with massive old-style RDMS. Whilst most of the world is moving from monoliths to microservices, there is always a transition period where legacy applications and components need to operate close to the speed of the new world. These legacy monoliths typically have a tightly coupled architecture which needs to be loosened to allow for incremental, small batch change, test and release in order to enhance velocity.

Applying the best approaches in isolation will be like taking a look at one tree and assessing the impact on the rainforest.

There is a proven approach for building consistency into software development, continuous delivery, which is here to stay. On the IT operations side we have chaos engineering, capable of swiftly uncovering the failures of software that teams aren’t aware exist but have the potential to ruin a business. Chaos engineering carries a real and very clear message that it’s preferable to constantly practice small failures than increase the risk of catastrophic public failure which can seriously and adversely impact a business’ reputation.

The Idea is to consider the overall ecosystem and when it comes to legacy choose your battle appropriately, which includes all your critical legacy systems.

Let me clarify a few things about resilience assessment approaches, Disaster Recovery versus Game Days before we proceed further.

What doesn’t matter for Chaos:

Size of your organization and team
Language technologies, development methodology (Waterfall/Agile)

Chaos to your Legacy CICD Pipelines

The ultimate goal of your CICD is to automate the software build process to enhance velocity. Once you set it up it makes sense to integrate chaos with your CICD. As part of the deployment pipelines, you can push your chaos files to start disruption in the specific environment. Here are a few scenarios:

Make legacy dependencies unavailable when you push a deployment
Introduce a failure in key codes and orchestrate a canary deployment
Reduce the capacity and run the load test just after deployment

Remember when you include chaos in your CICD pipeline to continuously validate key hypotheses where deployment should always succeed, even if capacity is low.

Useful Experiments on Your Build cycle

For legacy pipelines, let’s take the example of the mainframe. It starts with version control tools like ISPW, ChangeMAN, etc, Build, Release, Deploy e.g. Topaz, IBM tools, etc, Operate manual/automated, Monitoring BMC, Splunk, etc. Here are two chaos experiments which help to assess your pipeline:

Application-specific experiment: Where a specific idea or test design should be applied to check the reliability, and this can be a one-off experiment. This can be used in during develop, Build, test, deploy, operate and monitoring
GameDays: This will be more real-time with shared responsibilities across the team with a specific focus

The idea here is to speed up deployment and find issues before they hit production.

Which are the best monkeys for Game Days?

Latency Monkey: Remember when you move away from legacy, you mostly remain in a transition state so it’s extremely crucial to ensure integration. You need someone to challenge your latency
Big Iron King Kong (Legacy Monkey): This monkey should be able to allow you to experience below
Key DB start/Stop automatically
In introducing highly localized failure in a legacy system or make the system slow
Terminate entire Database or disconnect from Datacenter

For build pipelines, the golden spot remains in the middle because, usually, the software itself plays a role in responding to the failure. For example, the software might include an automated restart, throttling, failover, etc. If those are software functions, then the software can either work or not work, and the build should be able to uncover that.

A true differentiation of the best from the rest is, your growing focus on the reliability of the entire ecosystem, how effectively you test the resilience of your system from build to all the way through production. Chaos Engineering along with the future of releases CD are two best set-ups that use it effectively and get the maximum value of it.

DevOps Institute is dedicated to advancing the human elements of DevOps success through the SKIL Framework: Skills, Knowledge, Ideas, and Learning. Learn more.

This blog was originally posted on DevOpsInstitute.com.

Bring Chaos Engineering to Your Legacy CD/CI Pipeline

Written by Advancing the Humans of DevOps