Image for post
Image for post

Over the past few days Charity Majors has been at the centre of a wide ranging twitter debate around whether developers should go on call. Her position, which I agree with 100%, is that developers should.

Because it’s a conversation that’s gone on many tangents, it’s hard to pick a single twitter thread that’s representative. This one in the middle is pretty good.

The haters

There are two frustrating themes in the opposing side of the debate right now.

  1. An almost wilful denial that a humane model of on call can (and does) exist
  2. A betrayal between castes of engineers, that developers can’t be on call because they have valid personal reasons, but ops should because it’s their job

My feeling is that the first is a reflexive defensive posture of developers to avoid thinking through their unacknowledged position on the second. The really sad part of this dynamic is that while there remains a small moral failing in the general area of “developer privilege”, that in examining rather than rejecting humane on call programs you’d discover the underlying systemic problem of companies trying to get a free ride on 24⁄7 availability hurting everyone, including the companies themselves.

Developer Privilege

I cannot rightly comprehend a universe where the accident of your technical speciality makes you accountable for the quality of someone else’s work.

I don’t think I can say it any better than that.

I’m also not a fan of the arguments that “I can’t go on call because I have a family” or “I like camping so I can’t be on call”. Do none of you have any friends in other industries? I guess nurses can’t have families. I guess no doctors go camping. Electricians can’t join organised sports leagues. No train drivers can go to the movies.

Systems that run around the clock need around the clock coverage, whether that’s keeping people alive or selling doo-dads via a webform. Part of why software is so remunerative is that it facilitates round the clock sales (or activity or whatever) 24 hours a day, every day of the year.

When developers are happy to share in that windfall, and then lean on arguments that fly in the face of the lived experience of other professions to avoid sharing in the toil, I run out of all empathy. There will always be exceptional circumstances, but the answer to that is to participate by default and handle the exceptions, not opt an entire industry out to avoid the edge cases.

Humane On Call

  • Pay for on call. Pay a retainer for the rotation period (a week where I’ve worked) to cover the opportunity cost of the cinema you didn’t go to or nights drinking/camping you can’t do. Pay a call out fee for responding to an alert. I’ll talk more details about why this is important in the next section.
  • Rotations should be between 5–10 people. Too few and you have an out-size and unreasonable impact on out of work activities. Too many and the rotation becomes ineffective. Skills need to be exercised to be retained, and one in ten weeks is just barely frequent enough.
  • Heavily curate which alerts go to a pager. Not all systems or errors are created equal, and when you can put a dollar price on paging out you would be amazed at how easy that process becomes.
  • Let people swap on call time, from a few hours (to see a movie) to a whole rotation (to go on holiday). Only allow swaps, not giving them away. Everyone takes a fair share, but timing is flexible.
  • Have an escalation policy. It all becomes more manageable with support. Don’t punish people pulling in help from the next level.
  • Let the team fix things that break. Product engineers who have just responded to an alert have the perfect mix of motivation and context to get a fix in place.
  • Train your people. On board them well to on call. Do workshops. Have a game day. Schedule frequent refresher courses.
  • Have an option to opt people out for family hardship and the like.

I can’t say the above would work in every context, but it’s worked for me across a couple of roles so far.

The Free Rider Problem

If mum could do it so could I. Clinton (the other non-founding engineer) and I came back to the table and said we’d participate on those terms.

That startup was called Envato and it runs one of the highest traffic Ruby on Rails sites in the world. They still run the program roughly that way, and I made the same policy at 99designs when I joined the management team there.

It wasn’t until much, much later that I realised it (roughly) solves the free rider problem and sets up a virtuous quality cycle by aligning business and engineering incentives. Unpaid on call allows a business to profit from round the clock revenue generation from an always-on platform without paying its share in the costs of maintaining round the clock revenue generation.

  • Once there’s a set cost to maintaining a rotation, deciding which products and services need round the clock availability becomes an easy business decision
  • Once each page comes with a minimum cost (one hour at double pay), separating “informative” alerts from actionable ones becomes an easy business decision
  • Once software quality degrades to the point of services frequently paging and the costs stack up, ensuring the remediation work is completed becomes an easy business decision
  • Engineers still tend to value personal time (sleep, etc) more than the call out fee, and so fixing their software becomes an easy personal decision

Paying for on call isn’t a panacea. Incentives aren’t a silver bullet for behavioural change, but when a company can page an engineer for free, it’s no surprise there are so many bad actors.

Let me sum up

Originally published at on February 11, 2018.

Written by

Bicycle, book, and booze enthusiast. Rubyist and Gopher. Founder @HecateApp, ex @99designs/@envato

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store