Debugs and Diagnostics — Training for trouble with role-playing!

Richard Makepeace
ASOS Tech Blog
Published in
4 min readJan 18, 2024
Image by Nika Benedictova

Within the payments department at ASOS, we have an on-call support rota covering out-of-hours problems that people can join to earn some extra money and experience. Sometimes engineers want to join but are put off by the thought of having to deal with a problem in the middle of the night when there is only limited support available to them.

While we can help them gain some confidence and experience by getting them to deal with problems that crop up in the non-production environments, it doesn’t make for a particularly convenient way to train people as the non-production environments at ASOS are relatively stable. Additionally, the types of problems that crop up in non-production don’t always match those that we tend to see in production.

One of the ways we’ve tried to solve this problem in the past is by reproducing problems that have happened previously in a non-production environment and giving engineers the chance to solve them. This works very well but there are two significant problems with this approach. The first is that setting up a specific failure can often be a lot of work and the kinds of failures we see can’t always be created artificially. The second is that it usually impairs the function of the integrated test environment for other teams and prevents them from using the service while the exercise is in progress.

In the last couple of years, I’ve grown to love tabletop role-playing games such as Dungeons and Dragons, Wanderhome, and Call of Cthulhu. In these games, entire worlds, characters and situations are described by the person running the game and players interact with the world simply by telling this person what they’d like their character to do. I had an idea that this approach could potentially be used as a training method for diagnosing problems in the production environment and our principal engineer encouraged me to try it out with some engineers ahead of our peak trading period.

After a bit of thought, I came up with “Debugs and Diagnostics”, a system of role-playing technical problems and finding solutions. So how does it work? The person running the scenario (hereafter referred to as the coordinator) will describe a problem along the lines of “An alert has fired” either giving or describing the contents of the alert to the players. At this point, the players then can either ask questions or try to take direct actions to resolve the problem. For instance, they might say “I look at the logs for the affected service” and the coordinator will tell them what they can see in the logs.

The level of detail can be tuned by the coordinator according to what skills need to be practiced. This can be as simple as making sure that the players know where the logs they’re checking are located or as complicated as getting them to write the log query they want to execute and then telling them what it would return. It can also be used as a means of training people about processes and procedures by making it clear that they can speak to other people, raise support tickets or escalate problems as part of the exercise. In the event of the players getting stuck, it’s fine for the coordinator to drop hints as it’s meant to be a learning experience rather than a test. I’ve put a very small sample dialogue below as a quick example.

Coordinator: You get a call saying the message freshness alert has triggered for the payment processor endpoint.
Participant: I check the status of the payment provider to see if it’s experiencing an issue.
Coordinator: OK, how would you check the status?
Participant: I could search for it?
Coordinator: You could but it’s also listed in the runbook for the service.
Participant: Oh right yes, I go to the link on that page.
Coordinator: The page shows that they are experiencing latency issues at the moment.
Participant: In that case, I will raise a ticket with the provider and inform the manager on duty of the situation.

Training this way can be much easier to set up than trying to create issues in test environments but it’s not without some drawbacks. The main drawback is how much it depends on the person running the scenario to simulate the state of the logs and services in their head in response to questions from the players. This typically requires time up front to think about how various logs and data sources will look in this situation and to answer any unexpected questions in an ad-hoc fashion. For this reason, the role of coordinator should be taken by engineers who have experience with the systems or problem in question.

We recently held a one-hour session with 3 different coordinators from different teams running scenarios for 3 groups of about 5 engineers. Each scenario took about 15 to 20 minutes to complete, so all 3 groups were able to practice all 3 scenarios. The feedback was very positive and we’re looking to run more sessions in the future.

Despite needing a little work up front this can be an easy way to train new team members or those not confident in their abilities to handle unexpected problems and potentially even have some fun along the way.

Do you have any interesting strategies for training people on support or have you done anything similar with your team? Let us know in the comments!

Hi, I’m Richard, when I am not role-playing, gaming or playing the guitar I’m a Lead Software Engineer in the payments department at ASOS.

--

--