Reliable software through war games
At nine o’clock on a Wednesday morning the team arrived in a secret location. Two hundred metres from the birthplace of the river Mersey an unassuming café would play host to our first ever war games, a chance to discover how well we deal with production issues, to evaluate our documentation, and to work out the blind spots in our testing (and nothing to do with the 1983 film about nuclear warfare).
There are six developers in our team plus a tech lead and a project manager. We’re responsible for two different systems because, like the Mersey, this team was formed by the confluence of two smaller teams. It’s important that I mention this because when we combined those teams we were faced with a difficulty: each team knew how to support their own system, but knew little about the software written by their counterparts in the other team.
We had talked about the idea of war gaming for a long time but had yet to try it. Finally we made a decision: on the 4th of July, while our American colleagues celebrated Independence Day, we would take the team off-site and practice some realistic production support scenarios.
So how does it work? in essence you need somebody to break the system, and somebody else to fix it. The fixer should refer to documentation as much as possible, even if they already know how to fix it; that way we can learn how useful the documentation really is. The breaker shouldn’t give away what they broke, but instead say something vague, for instance “I can’t log in”.
We split the engineers up into three pairs, teams A, B, and C. Each pair would participate in two scenarios, and spend one scenario observing another team. We also scheduled time for discussion in-between scenarios:
Having six scenarios and three teams meant that each team could participate in one scenario for each of the two projects we’re responsible for, in addition to spending one turn observing. That’s a special case for us, but similar numbers should work well in general. As the tech lead, I came up with the scenarios before the games and occasionally used members of the observing team to help set up a scenario. The team members were given the following brief:
Although each of us knows a lot about our systems and can likely diagnose issues very quickly, try to follow the runbooks instead of your instincts
Runboooks are a kind of documentation that tells engineers how to perform common operations such as scale up/down the Kafka cluster or roll back a deployment.
The team was also asked to take note of anything that we need to improve, for instance documentation, unit or integration tests, monitoring, and logging. We would talk about some of these during the discussions.
The implementation of our scenarios included removing access to a database, removing Kafka nodes, and misconfiguring some of the microservices.
We asked the participants to take notes covering
- Suggestions: any improvements that need to be made to runbooks, documentation, logging, tests or monitoring
- Experience: Were the scenarios realistic? What scenarios would you add? What’s the most important thing you learned from the games? What would you do differently in future?
There were a number of interesting discoveries of which I’ll describe just a few
- One project’s runbook didn’t cover DynamoDB
- Common processes that users would perform were not fully documented which sometimes made it difficult for developers to reproduce the actions of the ‘user’
- The runbooks missed out a crucial edge-case when describing how to perform a full redeploy
- Broken links, incorrect names, and outdated information
- One scenario didn’t go as planned. The intention had been to misconfigure a service, but it turned out that in the development environment, that component was mocked, and the misconfiguration had no effect. This turned out to be a good thing because it made us think about our integration tests a little more
In preparing the scenarios two possibilities worried me: that the scenarios would be too easy, or that they wouldn’t work. As it turned out one scenario didn’t go the way it was supposed to, but that actually allowed us to uncover some things we wouldn’t have seen otherwise. The scenarios were more difficult than I expected but there were valid concerns from the team around how realistic those scenarios were.
It’s difficult to come up with realistic failures that remain isolated to a dev environment. Many of the real-life issues that we’ve encountered involved multiple systems going wrong. While our scenarios focused on ‘low-hanging fruit’ like bringing down the database, something more interesting like random packet loss over the network link, making the database effectively unavailable, would have been better.
Another situation to watch out for is when more damage is done by a scenario than intended. Because all of our scenarios were executed in dev environments we weren’t worried about really breaking things. Still, in one case when things got out of hand we were forced to clear down the environment and redeploy. Even that taught us something useful: there was a certain detail to redeployment that the runbook didn’t explain.
We had productive discussions during and after the event. Here’s a selection of quotes from the team:
I didn’t expect it to be as much fun to observe as well as participate
On future projects we should build runbooks as we go along, rather than waiting until later on in the project
I would like to have seen the wifi fail so that people would be forced to try tethering on the support phones
We can improve some of the logging. You probably shouldn’t even be seeing a stack trace in the log; or at the very least, it should be a concious decision to show one
Overall it was good. It definitely puts you on the spot
Your own war games
Running a war gaming session is difficult but rewarding. As well as being a valuable team-building exercise, you’ll come away with a better understanding of the strengths and weaknesses of your production systems, particularly when it comes to monitoring, logging, and documentation.
My advice to tech leads who want to try this is to test the scenarios beforehand, and make sure there’s an easy way to reset the environment quickly so it’s ready for the next scenario.
It’s important to get the team to buy-in to this activity. Involve everybody in the planning process and take suggestions as to how the session will run. Set out clear deliverables — such as ‘add runbook improvements to Jira backlog’ — and be prepared to improvise on the day, because things won’t go according to plan.
Running the session off-site isn’t mandatory. When surveyed, many of our developers were unconvinced about whether it was worthwhile going to a café. For what it’s worth my own argument is that offices come unwanted distractions, real production issues don’t always come up while you’re in the office, and apart from all that, the change of environment is conducive to team-building.