Documentation Day: How the FT.com team improved our documentation to 95% usefulness in 7 hours*

* Plus heaps of preparation and lots of help!

Jennifer Johnson
Jul 10 · 12 min read
Friendly runbook trophies up for grabs

In June a group of FT developers came together for a magical day of sugar and sun fuelled writing in an effort to improve our documentation. By the time we were done we’d improved the quality of the runbooks of our most critical (Platinum tier) systems by 25%, achieved at least 90% ‘usefulness’ on all of them, and shared some hotly contested trophies. We called it ‘Documentation Day 2019’, and I’m going to tell you how we did it.

Platinum tier system: a brand critical system that is required to be available 99.9% of the time and is supported 24/7, for example, the FT.com homepage.

Why now?

The drive for improving our documentation was borne of one of our Objective and Key Results measures for 2019 around continuously increasing the maintainability and sustainability of the FT.com and FT App technology stack.

Part of the motivation for this OKR, and our decision to focus first and foremost on our runbooks, was that they were not working effectively for either of their two audiences; the Operations team who provide first line support, and FT.com developers who provide second line support, and this was having a negative effect on both.

First Line Support: a group within the Operations team, working 24/7 to support 230 platinum systems at the FT, performing monitoring and basic support tasks for all of them

Second Line Support: Experts on individual systems (usually developers in the team that built them), called to provide fixes when Ops are unable to

When something goes wrong with one of our Platinum systems the Ops team try to diagnose the issue and fix it, based on the information in the runbook. However the information in the runbook is often vague, incorrect, or missing entirely, which leaves them unable to perform that function effectively.

When Ops are unable to fix an issue they call the Second Line Support team, that is the developers from the FT.com team that are on the support rota, in order for them to try and fix the issue instead. The problem is that developers on FT.com know that the runbooks don’t provide the correct information for fixing the associated systems and so they, understandably, don’t want to put themselves forward to be on the rota in the first place.

What this has led to is the Ops team being unable to fix issues on FT.com platinum apps and so relying on a small group of expert developers in the FT.com team to provide fixes, who have deep knowledge of the systems but often in their heads.

We hoped that improving runbooks all round would enable Ops support to better diagnose and fix issues, eliminating the need for calling Second Line Support in as many cases as possible, and when Second Line support are called they too have the information they need to fix issues, encouraging confidence in a wider group of FT developers to volunteer to be on the Second Line support rota.

Part of an existing FT.com runbook

How?

Conceptually, we knew what we wanted to do: improve our runbooks, but we still had much to decide about how to do it.

To prevent moving effort away from BAU for too long and to provide focus on the issue at hand, we had already agreed that we would attack the problem with the whole group working on runbooks for a single day. Maintaining and updating documentation would be ongoing of course, but this initial push for improvement we would do once.

That decided, we still had three big questions to answer:

Measuring success

Luckily for us we weren’t the only team at the FT focusing on runbooks in the run up to Documentation Day, and this project came together as a huge collaborative effort between the FT.com team and the Operations and Reliability Team.

The Operations and Reliability team’s focus is on transforming the way we support delivery teams and our products at the FT, by improving the way we execute our monitoring and reliability. As part of their own OKRs they had come up with a metric by which the quality of runbooks could be measured — the System Operability Score (SOS). The System Operability Score deserves a blog post of its own, but for now it’s enough to say that this is what we’d use to quantitively measure success — how much could we improve this score for each runbook?

System Operability Score: a score created to provide tech teams at the FT with clear guidance on what they can do to improve the operability of their systems, with critical issues having a higher negative impact on the score

SOS score in action

This was a great start, but metrics can often be misleading. The score could measure if a field on a runbook was filled it, but there was only a certain extent to which it could validate the data within. We would have to do more than use the score to ensure we reached the level of quality we wanted.

Focusing on the right content

As well as a quantitative measure, we also wanted qualitative feedback, and for this we worked with the First Line Support team — the primary audience for our runbooks. We wanted to understand what a great runbook looked like for them — what sections did they jump to first when there was an issue, and how could the content of these sections be optimised?

We were given invaluable advice by the Ops team, which stemmed from the way their team works: they perform monitoring and basic support tasks for hundreds of Platinum systems at the FT, so they are technically skilled but very broadly. As a result, they need runbooks to be easy to navigate and straightforward to understand. Some key points they asked us to consider were:

We were also helped by Jen Lambourne, the Head of Technical Writing at Government Digital Service, who delivered a brilliant talk for us in the week leading up to our Documentation Day. The talk was full of juicy content, but one thing it solidified for us was the value of task oriented content. In the case of our runbooks this was the troubleshooting section — providing a set of actionable, unambiguous steps for the Ops team to follow for each of the most likely to occur issues on a system, including what to do if those steps didn’t work.

On the back of all of this wonderful feedback we went away and worked on improving a single runbook to the Ops team’s specifications, if you’re in the FT you can see that runbook here. Happily, we got very positive feedback on our work and an SOS score for the runbook of 98%, We were getting there!

How to work most efficiently — the process

So now we knew what success looked like, we just had to make it as easy as possible for people to achieve in one day. We spent a lot of time on this in the run up to Documentation Day and that work was essential to the day being a success.

An early realisation we had was that some of the important fields in our runbooks would be the same for many of our systems (for example, Release Process and Failover Process). We spent time in advance writing shared documentation for these sections which could be linked to from all of the runbooks, saving people from individually writing content for these sections for each runbook on the day. Most importantly, abstracting this content away created a single source of truth which is easy to keep up to date.

We also created a detailed guide as a reference document for people on the day, explaining the sort of content that should be present in each runbook field, which fields are most important, and tips and tools for populating them. This, along with the example runbook we’d already written meant that that people had lots of guidance on what a great runbook should look like and how to get there.

Finally, we pulled all of this work together using Runbook.md.

Runbook.md

Runbook.md is a tool recently built by The Operations and Reliability team. It takes a runbook written in a Markdown file, extracts the relevant data, validates it, and imports it into the runbook database for presentation in the UI in the format used by the Ops team.

The primary benefit of Runbook.md to our teams was that it allowed us to move the contents of our runbooks from a standalone database to the repositories where the code for each system was stored. We’d had feedback from developers that having runbooks sit alongside code, like a README, would make them easier to find and maintain.

The huge benefit of this system for Documentation Day was that we were able to create a partly populated runbook file, in Markdown, for each Platinum system for people to edit on the day. These files could be updated by developers as the day progressed, and their content easily validated using the Runbook.md UI:

Runbook.md validation tool

We pre-populated each runbook Markdown file with:

A runbook in markdown, with pre-populated content and tips

Creating this pre-filled runbook file for each system enabled people to start thinking about content (rather than process) as soon as Documentation Day began.

And providing all of these reference points within the file allowed people to start with the hardest questions first on the day—i.e. those things that need a bit of research and couldn’t be easily guessed — rather than taking time and cognitive effort thinking about the more straightforward content that we had already automated away with the work above.

Making it all work on the day

And so to the big day itself! By now we were confident we had the infrastructure in place for people to create great runbooks, but we also wanted participants to feel excited about the work, learn something new, and enjoy it. We put a few things in place on the day to make this happen:

The Dogumentation-o-meter

Everyone really got stuck in, and it was uplifting seeing people pair up with colleagues they didn’t know very well and learn new things about our systems. I asked some of the prize winners on the day what they felt they got out of it, and here’s what they had to say:

“I spend most of my time with systems I know fairly well so I really enjoyed delving into something I didn’t have much familiarity with and discovering quite how bad the runbook for it was — we split up tasks, worked out how various things like the backup process worked and documented as we went.”

“The tooling around runbook.md meant we could start actual work within about five minutes, and the ability to submit our content and see our scores creep up as we improved the docs kept me motivated throughout the day (gamification irritates me but it does work!)”.

Conversely(!)… “Updates we made ranged from drawing a new architecture diagram to hunting down monitoring status pages. All really interesting stuff, made all the more exhilarating by the SOS score — gamified documentation is the best kind of documentation.”

“I think we both learnt a hell of a lot about how the Operations team works and its definitions. One standout was the difference between changes and rollbacks vs failing over and failing back, failing back was not something I’d really considered as something to document but it makes so much sense in hindsight!”

“Initially it was a bit intimidating documenting an unknown part of FT.com, but working with someone else really made the experience a lot less frightening… A nice side effect of the day was that it led to updates of other code and READMEs to make the system as a whole more understandable. If you’re in the mindset of tidying, it kind of spreads to other things you touch.”

The outcome

We had a great day and our runbooks are in a much better state as a result — we’d increased our runbook SOS scores by an average of 25%, and all of the runbooks we worked on ended the day with an SOS score of 90% or higher. And there were some really nice secondary benefits that we found came out of the activity too.

Participants weren’t just documenting but learning how some of our poorly understood systems work, in many cases becoming the new experts on that service, even putting in fixes and making cost savings along the way. Put another way, Documentation Day was a great way to transfer knowledge to a wider group of people away from the few people that had a deep understanding of our systems, and because we were working in pairs the knowledge was shared even further.

We’ve also started sharing our learning with other programmes so they can replicate what we’ve done, and it’s great to see our preparation work being reused and the enthusiasm for improving documentation spreading throughout the wider Technology team.

SOS scores for runbooks we worked on before and after Documentation Day

One thing that remains to be seen is whether we achieved our original aim of increasing the confidence of First and Second Line Support in fixing issues on our critical systems, as well as increasing the number of people volunteering to provide Second Line support. But it has only been a week so watch this space!


TL;DR — Takeaways to make documentation improvement a success:


Thanks to Rhys Evans, Tatiana Stantonian, and Alice Bartlett for their help with this post.

And thanks to Rowan Beentje, Sam Parkinson, Tak Tran, Umberto Babini, and Rob Squires for providing their feedback from the day.

FT Product & Technology

A blog by the Financial Times Product & Technology department.

Jennifer Johnson

Written by

Developer at the Financial Times

FT Product & Technology

A blog by the Financial Times Product & Technology department.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade