The internet never sleeps, and even with the best design for resilience, one day, your system will go down.
At Teads, we deliver outstream video advertising for the biggest content publishers in the world. Any downtime has important repercussions on our revenue but also on the publisher’s revenue.
In a few years we grew from a start-up to a scale-up, although we operate globally, our tech team is mostly based in France. For this reason, we decided to carefully think about scaling our on-duty team in order to minimize the downtime when a system goes down.
That story is below.
On-duty in a fast-growing company
In a few years, we’ve scaled from a growing startup operating with a few pizza teams into a company where more than 100 developers on 3 different locations deliver new features on a daily basis.
We’ve been able to do so by implementing our own version of the “Spotify model” and it has given us the ability to stay agile while growing the tech team. Applying the same recipe to the on-duty team was a challenge, to say the least.
Initially, the on-duty team was composed of a few developers that had been with Teads since the beginning and that were very knowledgeable on every part of the platform.
We relied on their knowledge, availability and on the fact that they helped build most of the system. As we grew, the system became larger and more complex. The handful of developers keeping the revenue safe overnight were now unable to keep up with the amount of knowledge needed to solve a problem.
First step: Growing the on-duty team
We started looking for people to add to the on-duty team and ideally have someone from each of our feature teams be part of the rotation.
This was our way of implementing “you built it, you run it”.
It meant growing that team to 12 people and that’s when we hit the first wall.
We tried growing the team while having a few visible production incidents (S3 Service Disruption in us-east-1, anyone ?) and of course, no one was voluntarily applying to be on duty.
Lost battle: trying to be ready
One of the main reason nobody was up to the task was the complexity of our system. It can be frightening not having a complete understanding of it while having to react when an incident arises.
We tried to tackle this problem, and for a few months we set up meetings, put knowledgeable people in a room and asked them to kindly document the steps to take when incidents happen.
This was too large of a mission, even for a highly motivated team. Soon, meetings were skipped, and documentation was not improving.
At this point, we started thinking about the problem in a different way.
Enter on-duty pairing
The first decision we took was to have two persons on-duty at the same time for a week-long shift. We tried to wisely choose pairs for mutually exclusive skills set and experience. We would for example pair a back-end developer with a data-oriented developer. This allowed us to cover most systems on the critical chain.
The benefits that we see with the on-duty pairing are:
- It’s much easier to bounce ideas off someone when a problem is impacting production and you (or your pair) do not know how to fix it.
- Sometimes while on-duty, the incident runs so deep that a critical business decision must be taken. It is much easier to share the responsibility of such a decision in the middle of the night. We accept that this may slow down the decision process as there will be back-and-forth between the pair.
- In the rare event of someone not waking up to the PagerDuty calls, there is a backup. Interestingly enough, we had never experienced someone not waking up until we started pairing. This brought the question that pairing may lower each individual’s sense of alert because there is a backup, but in the end we feel it has more benefits than downsides.
We implemented this change in a few weeks and so far we are quite happy with it. The team has scaled to 12 developers, coming from all feature teams, and the rotation goes smoothly.
The traditional way of dealing with increasing complexity is to have an escalation policy. We chose not to implement this and have PagerDuty automatically wake up both pairing developers at the same time.
This automates the decision of waking-up another human being and makes PagerDuty responsible for it. We don’t want to be responsible for this hard decision so we let the robot do it.
Escalation usually also solves the “I need an expert on [insert any well known distributed system here] and I need her right now” problem.
Putting them on escalation policies is great if you have a big enough pool of experts on each of the systems that you use. For us, this meant that a few persons would be on call every other week. We thought this was not acceptable and decided that we could solve this by :
- Telling the on-duty team members we know they will do their best to recover the issue
- Giving them the confidence that, as engineers, they will find a solution
- Automating as much as we can routine maintenance operations (taking a bad Cassandra node out of the ring, decommissioning and replacing a Kafka broker…)
Post-incident and Playbook
Soon after the incident, we gather everyone from the on-duty team in a room for a blameless, fact-oriented, post-mortem. We aim to leave the room after one hour having filled our very simple post-incident template:
- Summary of the issue
- How to reply to such an issue (should it rise again)
- Action plan
This process allows us to document our interventions and ensure, should the same incident happen again, we have a solution to mitigate its effect in a timely manner.
After a few months, we are quite happy with this new on-duty rotation. It has proven useful many times and we now have more documentation than ever on how to react to our alerts.
The post-incident ritual also acts as a team bonding meeting and we are thinking of creating more rituals specifically for the on-duty team (on top of each individual’s feature team rituals).
The biggest complexity that we encountered since launching was organizing the Christmas rotation period with pairs. It’s always a challenge to find one person available during those holidays, so trying to find two is double the fun.
This article was initially published on SysAdvent 2017 “the annual advent calendar for SysAdmins, Ops, DevOps, and all the other folks that are excited about systems.”