Listo — Failing Safely with Checklists and RFC’s

Julian Berton

Published in

SEEK blog

9 min readFeb 6, 2020

Use questionnaires and checklists to make it easy to do the right thing, when you’re building software

Introduction

To enable SEEK to build products quickly in a competitive landscape, we must accept failure. The question then becomes, how we fail without exposing SEEK or our customers to unnecessary security and reliability risks? At SEEK, we call this concept “failing safely.”

There are many ways we can reduce security and reliability risks when developing products for our customers, thereby allowing us to fail safely. Initiatives such as our Bug Bounty Program; automated rotation of secrets; dependency scanning; tech inductions for new engineers; security training programs and running capture the flag challenges throughout the year have helped educate our engineers; reduce the number of bugs and improve the culture and attitude towards risk.

Unfortunately, some of these controls and ad-hoc education sessions don’t scale as we grow, leaving us with this question. How do we consistently distribute our rapidly changing security, architecture and reliability advice, to a growing engineering community that prevents issues by giving teams tailored guidance when they need it?

RFCs & Paved Road Tooling

Our attempt to solve this problem was to turn towards standards. But we knew that traditional lengthy, hard to read standards were not going to cut it at SEEK. We borrowed the idea of an RFC process from this article and corresponding AWS Reinvent talk by Riot Games. In this talk, they discuss writing standards in an RFC style format and putting it through a review and approval process with their engineering community. We currently have 28 RFCs, and counting, representing a collection of the essential requirements engineering teams should consider when building products at SEEK.

Listo’s draft RFC that has been reviewed by several engineering community members

Even though these RFCs are more concise than a typical standard, we still found that around 60% of all RFCs don’t apply to a specific project or feature. Asking teams to remember which requirements apply to their tasks is a significant and unreasonable ask. This burden compounds with each additional product requirement under consideration.

To make matters worse, engineering teams vastly outnumber teams that have this expertise (i.e. architecture and security). This disproportionality leads to best effort engagement, often late in the software development lifecycle, when the product design decisions have already been made and are ready to be deployed. Consequently, knowledge and essential requirements are not getting into the hands of the engineering team at the right time.

So how do we enable developers to move fast and fail safely? More importantly, how do we achieve this as we scale our product development capabilities?

To answer this question, we looked at the aviation and medical professions who face severe consequences as a result of a failure. Both employed a straightforward process to improve safety in a complex environment — checklists.

Life-saving Checklists

Despite the considerable increase in air travel, rates of incidents and related deaths have steadily decreased since introducing checklists in the late 1930s. Surgeon, and author of The Checklist Manifesto, Atul Gawande devised a simple checklist for operating theatres and introduced them across eight hospitals. The checklists cut death rates in half and reduced related complications by 36%.

In the tech industry, several companies have adopted checklists with great success. Slack has published an article and delivered a talk which describe their success using checklists to improve the quality and security of their products. Slack also open-sourced a tool called goSDL, which served as the foundation for Listo.

Introducing Listo

Checklists are at the heart of Listo, empowering engineering teams to perform a web-based self-assessment, which results in a Trello board containing the essential security, reliability and architecture requirements from our RFCs, tailored to a project’s objectives.

Today, we are excited to open source Listo! A tool we use internally to prevent risks when building software, by making it easy to do the right thing.

All of the questions and checklists within Listo are specific to internal processes, tools and RFCs within SEEK. These checklists make the guidance useful, relevant and specific to our engineers. We have included a sample of the data here so that you can get an idea of how to write your own.

The Listo Self-Assessment

Although it’s never too late within the Software Development Life Cycle (SDLC) to fill out an assessment, we recommend performing an assessment during the design phase as a Read-Do checklist (where you perform each item and check it off, like baking a cake). This is our recommendation because it’s easier to change and adapt during the early stages of the SDLC. However, if teams are confident with the requirements or the project is low risk, they might decide to fill it out when it’s ready for release as a Do-Confirm checklist (where you carry out an activity, and then review what you have done).

We also recommend performing an assessment whenever starting a new project or making a significant change to an existing product. Our aim is to minimise the assessment duration to 10–15 minutes, excluding triage of the resulting Trello cards.

The assessment begins by collecting metadata about the project. It then asks questions to help assess the business risk. For SEEK, risk is primarily defined from a security perspective.

The tools section is where Listo shines. We provide a list of internal paved road tools (e.g. auth sidecars, build and deploy tools, dependency management services) designed to help teams write, build and deploy secure products faster. These tools are the preferred choice at SEEK, cater for most product use cases and have many requirements built-in. Examples include patching and hardening requirements; zero trust build pipelines; and central logging and monitoring. Thus, our engineers have less non-functional requirements to consider and can focus on building their product.

However, providing the tools alone does not signal to an engineering team which requirements within the standards still need to be considered during product development. Listo solves this by mapping tools to requirements. When a tool is selected, Listo automatically checks-off the requirements satisfied by that tool within the resulting Trello board. This automatic selection enables engineering teams to distinguish which requirements are covered by the tooling, and which requirements they need to cover themselves.

Once the tools are selected, Listo presents questions related to the product or feature for example, “Will you require authentication?” or, “Are you using a database?”. Each question selected translates into a Trello card. Additionally, a set of mandatory cards, containing generic checklists that apply to all projects, are included. For example, “We understand our commitments of being in scope of the SEEK’s Bug Bounty Program”.

A Trello board specific to the project is then created, containing the checklists a team should complete. We chose Trello because it’s easy for engineering teams to move cards around, add comments, tick off checklist items and track progress. In the future, we would like to extend Listo to support other types of project management tools.

On-boarding Guidance

We designed Listo as a self-service tool to empower teams to assess their projects. Therefore the questions and checklists must be easy to understand. To help onboard teams, we:

Added help tips within each section of the assessment and ensured each question has a description and links to find out more about the tool or checklist item.
Created an RFC for Listo to help guide teams through the process and to make it clear what the business expects.

The Questions and Checklists

The questions and checklists are what makes Listo useful. We used the data from goSDL as a primer to customise the questions for SEEK’s context. We focussed on the most critical requirements (i.e. common mistakes or high-risk issues) instead of overwhelming teams with every requirement.

Writing easy to understand checklists with clear Definitions of Done took a lot of tweaking to strike the right balance of succinctness and clarity. The book The Checklist Manifesto has several useful pointers scattered throughout, which have been nicely summarised within this article by Leigh Dodds.

Resulting Trello Board with cards and checklists

Supporting Team Involvement

Listo does not replace conversations with teams like security and architecture. Instead, Listo helps improve collaboration between engineering and supporting teams:

Every completed Listo assessment sends a Slack notification to a public channel within our SEEK Slack, allowing anyone at SEEK to follow along and view the Trello board.
Listo’s risk assessment component produces a risk score along with a question asking the team completing the assessment if they would like help from the security team. This information is included in the Slack notification Project page.
Throughout the documentation, RFC’s and our internal marketing, we encourage teams to reach out if they have any questions related to the assessment or Trello Board checklists.

Example Slack Notification from a Listo Assessment

Our Results

We have tested Listo internally with a handful of different engineering teams and have received both positive and constructive feedback on the process:

The overall experience of filling out an assessment, to triaging and completing the cards within Trello was intuitive. It allowed teams to self discover tools, documentation and recommended requirements they had not heard about in other SEEK engineering forums (i.e. meetings, Slack, tech induction, RFC’s).
The process was easy to fit into our current SDL, taking us on average 10–15 mins to fill in the assessment and an hour to triage and prioritise the resulting Trello cards into tasks.
Some of the checklists within the Trello cards are too broad and the Definition of Done is unclear. We are constantly iterating and improving our checklists to make them clear and concise.
Support for other task tracking systems has been requested (i.e. Jira, Github issues) as not all SEEK teams use Trello. For now we ask teams to track the Trello board within their own system by creating separate cards for big tasks or linking to the Trello board.

It’s a bit too early for us to share data on whether the Listo process has helped reduce risk and scale specialist knowledge. However, from the assessments completed so far we can anecdotally say it has helped prevent several projects from going down a riskier path, (i.e rolling their own auth layer instead of using SEEK’s authentication sidecar, or forgetting to turn on log forwarding for their application). We are excited about the results and feedback so far and are planning to roll Listo out to more teams soon.

What’s next?

We have many ideas to improve Listo further, like adding metrics to track Trello board progress, a dashboard for teams to see all their created Listo assessments and many smaller usability features. This year we are planning to get more teams filling out the Listo assessment while developing their products. We’re excited to collect more data regarding its success in reducing risk and whether it helps scale specialist knowledge as we grow our engineering community.

Want to give Listo a try?

If you’d like to try Listo, the tool and related documentation can be found below.

https://github.com/seek-oss/listo

P.S. Listo was carefully crafted during many “lunchtime hack sessions” and at internal Hackathons by a small non-official team of passionate SEEKers. Without whom, Listo would not exist. I’d like to give a massive shout out to the core team and everyone else that helped shape the development of Listo.