Operations in Teams — Part 2

Published in

acast-tech

5 min readNov 24, 2021

By: Meidi Tõnisson-Bystam and Miles Tuffs

The first part of this blog series described our journey and transformation as we introduced operational responsibility as a part of every engineer’s job at Acast.

The words “on call”, and an expectation that you’re available 24/7, can be daunting for those who haven’t done it before — and it can also look very different depending on how it’s implemented in different companies.

To help explain what this looks like at Acast, we hear from two of our engineers about their experiences being on call.

Meidi Tõnisson-Bystam, Engineering Manager

When I joined Acast in early 2020, I was asked by my manager if I’d like to eventually join the on-call rotation for the team. As an app developer there was no contractual obligation for me to join — Android development and AWS infrastructure don’t typically have much overlap, after all — but I was immediately curious.

There was another app developer on the team that had joined the rotation, so there was some precedence, and I’ve always been interested in zooming out from my daily development tasks to see how all the different pieces fit together.

We started out gradually, having me ‘shadow’ a more experienced on-call person. All this meant in reality was that I would get the same text messages and alerts they would get, but I wasn’t expected to have my phone turned on at the weekend. I eagerly awaited each alarm and tagged along during troubleshooting sessions, asking lots of questions and taking notes on how to triage and resolve each issue.

I quickly realized that being on call was less about waking up at 3am to duct tape together a failing system than I had previously feared. It was obvious that a lot of time had been spent on fine-tuning alarms and making adjustments to our systems, which meant many of the scarier issues were taken care of and wouldn’t reoccur.

If something did fail during off-hours, the expectation wasn’t for me to sit alone in my bedroom trying to desperately deploy new code in an attempt to get things working again, but rather to do some basic troubleshooting à la “turning it off and on again”.

I learned how to scale up services, re-run batch jobs, and re-post failed messages to a queue after restarting a failing service. I also learned how to triage and make sure an alarm was indicating a real issue and not something that would self-resolve in five minutes.

Most of all, I learned the greater scope of how the app I was building fit into the infrastructure of the services and components our team was responsible for. I felt myself growing from ‘Android developer’ to ‘Software engineer’ in a very real sense, which was very empowering.

When I finally did go on call on my own for the first time, after having worked on the team for a few months, I can’t say I wasn’t nervous. I slept the sleep of a new parent — always ready to wake up at a moment’s notice to jump on a potential issue.

Funnily enough, the universe spared me completely that first time around, and not a single alarm popped up on my phone for that entire week.

Miles Tuffs, Software Engineer

When I joined Acast, and specifically within the team I joined, there were some obvious problems that people were aware of — and others that maybe they weren’t.

As far as I saw it, these issues included:

Having had a few people build a lot of necessary (but not necessarily well-architectured or documented) products as the company grew from nothing, which led to all the knowledge about the implementation, maintenance, deployment, specific business cases and so on resting in the heads of one or two people;
Having not very much documentation at all;
Being stretched thin on work, with all team members working on different projects, which meant it was very difficult to be familiar with the whole code base we were responsible for, since we were switching contexts so often;
Having too many support issues internally, resulting in (some weeks) one developer’s whole work week being spent fixing support issues.

Taking full days here and there to focus on documentation and knowledge sharing was a temporary solution, but we seemed to find it hard to fit in and justify the costs.

In the early months of 2020, Acast implemented an on-call schedule and we began to take turns each week going on call.

Not being a team that takes care of too many business-critical or high-traffic uptime-critical products, we went into this thinking we would need to fix problems here and there, but that on-call would become ‘free money’.

But it’s not about on-call being a lucrative financial scheme. It’s more a mantra we have about removing on-call blockers, coding and reviewing that doesn’t take risks for on-call (no friday deploys, for instance), and prioritising knowledge sharing so that, ideally, everyone should at least be able to understand what’s wrong if a system they haven’t touched yet has an issue — and be able to solve it, or at the very least rollback.

I think of the “issues” I mentioned above, we saw some major improvements in just about all of these. While I don’t attribute it all to on-call being added to our process, I think it was a crucial part of sharing the ownership of the problems we faced as a team — using this shared burden and the empathy for other team members to prioritise issues surrounding maintenance, documentation, knowledge-sharing and reliability.

As for what it’s like to be the person on call, in the beginning there were definitely issues here and there — and much more unknowns that could occur — but it felt like, even within six months, we’d prioritised and sorted a lot of our issues.

I can only speak from personal experience here, but I’ve been motivated to gain a much better understanding of our systems and how they connect together. Fortunately, I find myself on a team that doesn’t have many critical issues waking me up in the middle of the night — and if we did they could be fixed with a rollback.

And, while Acast is a 24-hour company with employees and customers working and listening around the globe, I find the guidelines we have for on-call — and the control each team has of their own process and what constitutes a “wake me up at 2am” request — make it a relatively stress-free process.

Getting a three-day weekend every now and then is also pretty nice.

Operations in Teams — Part 2

Written by Acast Tech Blog