Glasswall SRE — how we work, ‘toil’, and ducks

Sam Gibson
Glasswall Engineering
5 min readNov 28, 2019

The Site Reliability Engineering (SRE) team at Glasswall is responsible for making sure our SaaS FileTrust platform is as reliable as possible for our end-users, whilst ensuring this focus on reliability doesn’t hamper innovation in terms of new product features. If you’d like to know more about this, then Alex (our glorious team leader) has previously written a great article all about it.

We manage our work via a Kanban board, similar to the one described by our cloud architect, Paul, in his article. We create work items on this board, and place them under ‘epics’ and ‘features’ depending on the overarching goal of the work. Based on their nature, pieces of work on this board get placed into one of four categories:

  • Planned SRE — engineering/operations work that originates from within the team, with an end goal of improving reliability, visibility, and productivity. It’s ‘planned’ because we can backlog it and allocate time in our sprints for it.
  • Planned business — engineering/operations work that originates from outside the team, with an end goal of improving reliability and product value. Again, we can backlog this and work on it in a sprint.
  • Incident — unplanned work that is related to addressing defects in the platform, causing unreliability.
  • Toil — unplanned work that could be automated, such as release requests, replaying dead letters, and performing common log analytics queries.

If you’re a fan of The Phoenix Project, you may notice that these categories of work align closely with its ‘four types of IT work’.

Every two weeks, we have a sprint planning meeting, where we pick work from our backlog based on priority and whether we think we can get it done in two weeks.

A snippet of our Kanban board, showing work items categorised into the four types of work

Going by Google’s SRE book (from which our entire profession has risen), we aim to put a hard cap on the amount of unplanned work that any given team member does in a sprint, as if you don’t put in time to do the planned work and make the system more reliable, the unplanned work will quickly pile up and become intractable without a larger team.

“Eventually, a traditional ops-focused group scales linearly with service size: if the products supported by the service succeed, the operational load will grow with traffic. That means hiring more people to do the same tasks over and over again.”

On unplanned work

It’s always there. You can’t backlog it, you can’t anticipate it, and it has the potential to completely defenestrate your sprint plan.

Google’s SRE book has an entire chapter dedicated to its systematic elimination. To them, unplanned work is ‘toil’, and is characterised by work that has any of the following descriptions:

  • Manual
  • Automatable
  • No enduring value
  • Repetitive
  • O(n) with service growth (scales linearly)
  • Tactical

Sidenote: There really needs to be a better acronym for these. ‘RATNOM’? ‘MAN ROT’? Those don’t sound great.

To us, toil tends to be things like fixing CD pipeline failures due to outdated scripts, updating software on our clusters, carrying out manual processes in the system backend that are currently unautomated, and manually running/checking CD pipelines for production deployments.

All of these things embody almost all of the values given above. They are also mainly driven by interruption.

As mentioned previously, it’s very difficult to backlog or anticipate toil. It mainly crops up in the middle of the day when something goes wrong, or someone wants something done, and it tends to be fairly urgent. This leads to members of our team being interrupted when carrying out our planned work, losing their focus on it, and making mistakes or oversights on what is usually critical production-related work.

This is a fundamental problem with the kind of work we do; how are we supposed to perform mission-critical work to reduce toil and make the system more reliable, when whenever this work is being carried out the team is interrupted and loses their focus?

The answer? Ducks.

‘Duck duty’

If you made it past the large image of the rubber duck and are still taking this article somewhat seriously, then I congratulate you! Bear with me.

Inspired by Chapter 29 of the SRE book, and Cal Newport’s ‘Deep Work’, we’ve come up with a solution to the interrupt problem: ‘Duck duty’.

We’re currently a three-member team, and our on-call rotations last for one week each. What if the person who is on-call handles all interrupts for the entire week that they’re on call? Interrupts are almost always related to changes in the system, and changes in the system are responsible for incidents more than half of the time (some sources place this figure at around 80% of the time!). It makes sense for the on-call person to have the entire week’s context of change requests if they are to also deal with any incidents caused by them.

If you have a dedicated ‘interrupt person’, this frees up the other members of the team to get their heads down, focus, and truly achieve the ‘flow state’ of deep work — increasing productivity, and decreasing mistakes.

So, where does the duck come in? Well, Ash (one of our SREs) brought it into work one day and we thought it’d be a good idea if the interrupt person had it on their desk for the week they were on duty. It’s an easily noticable sign for everyone else in the office that this is the person to grab for any queries, and ‘duck duty’ is a lot more fun to say than ‘interrupt person’.

We’ve been using this system for just over two months now, and it’s working. Routing all interrupts through one person allows us handle the unending flow of toil while still allowing for quality engineering work to be carried out. If you’re having problems managing toil, get a duck!

Sidenote: no one’s replying to my emails asking for Glasswall-branded ducks. If you’re a rubber duck manufacturer and are reading this, please route all requests through the team member with the duck.

Look out for my next post that will go into more detail about our approach to planned work that reduces toil.

--

--