We Ditched our On-Call Rota

Nathan Mclean
spaceapetech
Published in
4 min readSep 28, 2023
Photo by Gary Chan on Unsplash

The DevOps team at Space Ape has always participated in an on-call rota, with each person talking their turn spending 7 straight days and nights on-call, dragging around a laptop. Now the rota is gone and the laptop mostly stays at home, but we’re still on-call.

In this post I’ll talk through the history of on-call in Space Ape and explain how our new system works.

History

In the early days of Space Ape (before my time) responding to incidents was a constant in life. Space Ape was figuring out how to develop, run and scale games as scrappy start up. This meant lots of things broke, a lot.

But lessons were learnt and, by the time I joined in late 2016, on-call was much more manageable. However out-of-hours alerts still went off more often than we’d like; probably a couple of times a month.

Since then we massively reduced the number of out-of-hours alerts that we had to respond to. How we did this would probably need a couple more blog posts, but in short we:

  • Stopped alerts that could wait until morning from escalating — That high CPU alert does not need to wake you up
  • We focussed on a few key metrics that indicated poor performance for players — Request Success Service Level Objective (SLO) and Sharp drops in Concurrently Connected Users (CCU)
  • Made a huge number of varied performance and stability improvements to both Client and Server code and to the infrastructure
  • Invested heavily in load testing prior to launch. We test our servers can handle load beyond even our wildest dreams of success
  • Created a company wide escalation procedure. Now anyone can point out a serious problem and escalate it to the right people, any time of the day, and get a suitable response
  • Gained a huge amount of experience in running games at scale, which helps us make good decisions in development

All of this meant that the burden of being on-call 24 hours a day, 7 days a week was starting to out-weight the value to the business. We went months without an out-of-hours incident, but someone still needed to have their laptop with them.

Team members have to adapt their lifestyles when on-call — can I go on out tonight? Can I go on that hike? Will I have signal, or even a place to work? Can I keep the laptop safe when out in public?

We also spent a significant amount of time chopping and changing the on-call rota to work around holidays and individual schedules.

So a change was needed.

The New System

Ok, so technically the whole team (currently 4 people) are now on-call, all the time… But let me explain.

Instead of a rota we now send all out-of-hours alerts to one member of the team, if after 15 minutes they haven’t acknowledged the alert then the next member of the team gets the alert and so on until all members of the team have been alerted, then it goes back to the beginning.

If you respond to an alert you get moved to the bottom of the list, so that in future you’re the last to be called.

At the time of writing I have been at the top of the list for at least 6 months.

Before we implemented this we had a think about what could go wrong and how we could mitigate these issues.

Responding without your laptop

There’s now a higher chance than before that you might respond to an alert without access to your work laptop.

So we investigate how we could respond, using just our phones.

  • We made sure that we could connect to the VPN to gain access to internal tooling (such as logging and deployment tools)
  • We updated internal web tools to ensure they are mobile friendly
  • We made sure that we had the apps we needed to access resources — can we get to passwords and MFA codes, Slack and GitHub
  • We checked that we could use other web interfaces, like AWS or Kibana, through our phones.

What if No One is Available

With a small team it seemed possible that everyone could be busy and have gone out without their laptop.

Firstly we checked that we can provide some support from our phones, as described above.

We also have a check in each week, at the end of our team guild meeting (a chance to share with the rest of the team) to ensure that at least someone will have access to a laptop, or be within an hour of one, over the weekend.

Generally we’ve found that at least one person will be around, or that someone is doing something where carrying a laptop with them isn’t a burden.

Holidays

When you’re on holiday you remove yourself from the list of responders, holidays are your time and you shouldn’t be responding to incidents — go and enjoy yourself!

Exceptions

Sometimes we revert to our old system of rota based on-call. This is generally around times of significant importance, such as the launch of a new game. As it happens we’re currently doing 24/7 on-call to support the release of Country Star.

This is generally for a period of 1–2 weeks post launch and we each get to volunteer for the days that suit us best.

Summary

It’s possible to ditch your on-call rota, provided that the volume of out-of-hours incidents is at a suitably low level and that your team agrees to it.

For us at Space Ape it has reduced the stress of maintaining an on-call schedule and re-arranging lifestyles to ensure we can have a laptop, internet access and a place to work 24/7.

--

--

Nathan Mclean
spaceapetech

DevOps Engineer for Space Ape Games based in London