By Geoff Thorpe
The FT’s first line support/operations department rely on consistent documentation to provide details on how to operate and recover our production systems. The recent introduction of GDPR and the global need for more secure systems has highlighted many failings in those documents.
This is how we encouraged the engineers to produce and improve the documentation by appealing to their competitiveness.
Back in late 2017 our Operations Manager realised that he had no measure of the quality of the documentation his team were required to use. He just had a really bad feeling about the number of critical unpopulated sections. He proposed we introduce a scoring mechanism to:
- Prove he was right on the lack of completeness
- Focus on the worst sections
- Measure improvement
His aim was for all systems to have a stated owner, a known service level and to describe their failover and key rotation processes. Without those critical aspects we were in danger of extended outages or security issues.
Runbooks — A Good Basis For Improvement
The issue highlighted by the Operations Manager was that some runbooks for platinum systems are just a list of empty or very poorly completed sections.
If we could nudge the engineering teams to populate the most important sections for the most imported systems we could achieve a huge jump in supportability and substantially reduce operational risk.
In 2018 the FT created a Reliability Engineering team whose focus is to provide tools to allow Operations and Engineering to work together to deliver and support systems more efficiently. That team undertook the task of producing a score for each runbook. This automated score would then form the foundation for a subsequent qualitative manual review of the content by the Operations Team.
Phase 1 — Proof of concept
Runbooks are produced by an AWS Lambda function which writes an html document to an S3 bucket everytime anyone changes system details. That logic was extended to trigger a call to a second Lambda that produced scores by counting the number of blank or inadequately completed fields in a json representation of the runbook. A quick format of those scores as an html table gave the Operations Manager his evidence — many unowned and incomplete runbooks.
Phase 2 — Focus
There are over 50 fields in a runbook. That is far too many to provide an effective method to nudge engineering for more information so we focused on just those fields associated with ownership and recovery. There was also concern that the raw nature of the html table would be met with distaste by the engineers who are used to much higher quality output; after all they do provide the design that you see every day on the FT web site.
We also needed a brand name for these scores. System Operability Score — SOS.
Phase 3 — Dashboard Design
We introduced our designers to the concept of scoring and requested help to format an eye-catching dashboard. They quickly came up with a multi-layered approach. Each layer of which incorporated dials and colours to focus the viewer on the runbook areas in their control which required the most work.
Phase 4 — Engagement
We now had the power to engage with our engineering teams. We were offering a dashboard for every level of seniority. The senior managers were keen to see how their division/group was scored; team leaders were made aware of their team’s average and each engineer could see the scores for the systems they had worked on. Clear error messages highlight what needs to be fixed and how.
Our Customer Products Engineering team jumped at the chance to use these dashboards and promptly set up a Documentation Day — you can see their blog here.
It’s worth noting at this point we realised that the power of engagement would allow us to highlight more than just blank fields; we could also highlight inconsistencies, missing dependencies and orphaned systems.
Let’s Get Competitive
There is something about being shown a score that gets people going; particularly when that score is shown in comparison to their peers. After just a few hours of the formal release of SOS into production — via an announcement in our CTO’s weekly report — the scores started to rise and the complaints started.
- “I can’t get to 100%”
- “Why does it require a healthcheck?”
- “I’ve supplied monitoring but it is still reporting an error”
Yes, we did have some errors but we had captured the interest of engineering and they had started to improve the documentation. We now had a set of rules we could improve and a set of scores from which we could derive a threshold for ‘good’.
Our initial rules had targeted areas that needed to be improved but we had been over-zealous to ensure nothing fell through the cracks — e.g insisting on healthchecks and dependencies for all systems including those managed by our third parties. A good reminder to us in Engineering to revisit our understanding of the various types of third party involvement.
The teams had also become ultra competitive with our slack channel being used to remind everyone who is top of the team leader board (fortunately it’s the Origami team who are part of Operations and Reliability).
However, we have seen evidence of foul play. One team leader realised that whilst fixing the documentation for his set of systems he had found a common failing which, when fixed, jumped his score by a couple of percent. It wasn’t quite enough to get to the top of his division’s scoring — so he edited a competitor’s system to reduce their score. His elation was tempered when he was shown how our audit trail pinpointed him as the last person to edit his colleague’s system.
It has now been a couple of months since SOS went live and we have corrected/improved the rules. Engagement is still very high so we have moved to a 70% ‘good’ threshold with 90% as the goal:
- 23% of our Platinum systems have reached the goal; 88% are ‘good’
- 12% of all our systems have reached the goal; 60% are ‘good’
A huge thank you to all involved and to those who provided these quotes:
- “I’ve never seen people so interested in documentation” — Operations Manager
- “Whoever designed that page …it’s really good both visually, and in terms of usability its great” — Principal Engineer
- “It took me ten years of trying and I didn’t get that far” — Previous Operations Manager
- “I want this!” — a visiting technical leader from another company
If your systems are in danger use an SOS.