Times have changed. Not five years ago, we used to leave the job of being “on call” to the mysterious “DevOps team”, who would ensure our APIs and websites ran shipshape.
At SEEK it is of utmost importance that each developer in a team knows not just how to build code, but how to ship and support it.
How can you ensure that all developers, from associate to senior principal, know how to effectively support the systems that they are building?
The confidence gap
In a recent Agile team retrospective it became clear that the confidence of each developer to support our production system was lacking. Each developer has individual strengths and weaknesses; areas of the code they are intimate with; and code they have never seen before. There was a lack of understanding about how to support the systems they were responsible for. This was evident in my early career, because I was terrified I wouldn’t know what to do when an incident occurred.
From a young age, I was taught a fundamental behaviour to increase confidence in something: practice makes perfect. But there’s a problem.
The only way to learn about production support is to be thrown in the deep end and do it. That’s how I learnt it. Seriously.
We’re in 2019 and we can do things better.
The support workshop
Following our retrospective about production support, I decided to experiment with a simple support workshop.
My criteria for success were to:
- Increase the confidence of each person
- have a list of what was crappy about the process of production support
- have another list of what was awesome about the process of production support
- Finally, a set of actions that the team could take to make the crappy stuff awesome.
Because let’s admit it, developers aren’t perfect and the tools, logging and fancy dashboards we setup for production support might never be realistically useful or accessible.
With that criteria in mind, let’s to prepare for the workshop. Ensure you have a meeting room booked for at least an hour.
Build your scenarios
With the workshop meeting booked, the first thing you should do to prepare yourself for this support workshop is build your scenarios. Scenarios are relevant and realistic examples of real production incidents that can happen or have happened in your team before. This must be done by someone with rich experience supporting the system you’re workshopping, such as the team’s tech lead.
Get a set of cards, the same ones you’d put on a Kanban wall, and start writing scenarios of production incidents that have happened before. Aim for double the number of scenarios per developer in your team. For our team it was ten.
An alarm for DynamoDB write capacity has triggered, indicating a sudden spike in traffic
<An external team> report they are no longer receiving data from our system.
<An external team> report that users are seeing an error when loading the web page. There hasn’t been an alarm.
The user API is alarming, returning HTTP 500 errors
There are many ways you can write scenarios — I will that leave that up to your creative brains.
Set the scene
Notice that I’ve been generic about what’s happening in each scenario. The reason is that whenever there is an incident there will be information or evidence provided to you in some way. It could be a PagerDuty incident, a Slack message or a tap on the shoulder from a concerned workmate.
With that in mind, open your word processor, dust off the Slack archive, screenshot some real information from real production incidents and print them off. Cut out each piece of information and stick it on the back of the relevant card with a glue stick or some tape.
If you don’t have any recent examples, make some up! I used my nearest image editing tool to create some “real fake” information about one of our services, like this one below.
Now that you have your scenarios, throw them all into a small cardboard box or a hat.
Setup the meeting room
Like a retrospective, you want to setup the meeting room so the team can aggregate everyone’s thoughts whilst the meeting is running.
I chose to put up three sheets of paper on the wall with the titles “What’s Crappy”, “What’s Awesome”, and “Actions”, from the criteria I spoke about above. Each developer in the room should be armed with a sharpie and Post-it notes as this is a collaborative process.
When you’re ready, pass the hat around the room and get each developer to take one scenario each. Have a laptop on hand for everyone to share.
Start your investigation
Starting with you, read out the scenario, followed by the information you’ve been given. Use your laptop and take the steps you would follow to investigate the problem. This means login to your cloud provider to check dead letter queues, perform a log search or query a record from your database. Even fake message someone on your work’s messaging platform or draft a fake email to gather more information. Pretend there’s been a dead letter and take steps to replay it. It’s important you set the standard for how much detail you want each person in the room to go into when discovering information about the incident.
While you’re doing all this, you need a scribe. The scribe will write down all the steps you take on post-it notes, so you can refer to it after. A developer should do this, but it’s okay to have your Delivery Manager, Production Manager, or UX Researcher volunteer for the role.
The scribe writes down some of the actions you took and sticks them onto the scenario.
Once you’re done, with the scribe’s help go through why you took the steps you did to resolve the issue. Talk about what was crappy about that process. How bad the alarm was, or why the incident happened without an alarm, or why the dashboard was inaccessible, or why the log search took ages to tell you what was wrong.
Put all the crappy things up on the wall in the What’s Crappy section as Post-it notes.
Similarly, talk about what was awesome. The hope is that all the awesome stuff will be things you can replicate across other services or systems.
Pass the torch (or laptop in this case) around and let each person in the room read out their scenarios and attempt to investigate the issue behind the scenario. It is important that you make it clear that it’s okay if they don’t know what to do. That’s the point of the meeting.
Finish off as many scenarios as you can and ensure you leave at least 15 minutes to reflect on the exercise — so set a timer.
Once the timer is up, you should have a whole heap of crappy and awesome Post-its on the wall.
Now, discuss what actions your team might take. These are commitments your team will make to improve the process.
And we’re done! Your team can prioritise these actions and help improve the production support process.
I would recommend running a support workshop every month. If you have a lot of actions to take or a lot of crappy items to fix, run the workshop more frequently until your team is confident around its support procedures. This allows your team to revisit actions you’ve taken and for your team to constantly be aware of process improvement around production support. It also means they’ll gradually get more and more confident.
A great way to ensure you’re making progress towards process improvement and team confidence is by sending out a monthly survey to your developers. A tool such as Polly, used through Slack, can be set up to automatically send out a poll with the responses collected anonymously. You can ask one simple question:
On a scale of 1 to 10, how confident do you feel in supporting <insert system or team name>?
This will provide you with great ammunition for conversations about process improvement during each support workshop.
You’ve seen how to run a simple recurring support workshop to build your team’s production support capability. What methods do you use to improve confidence in your team? Share your ideas and feedback in the comments below.