How to survive on-call in 4 steps

Published in

The Glovo Tech Blog

6 min readOct 30, 2023

Intro

Many tech companies have adopted the concept of “on-call”, meaning being ready to provide support in case their service is not working as expected.

But while there are well-established best practices for the monitoring/observability and incident response platforms, I think we need to focus more on the preparation and the semantics of Incident Response.

So in this article, you can find 4 very personal tips derived from my on-call experience that I hope could be useful to share!

Breathe

First things first: Put things in perspective

This depends on the domain, but in the majority of tech companies, the incident that popped up on your phone and is making your heart race is not going to put someone’s life at risk, so the first thing to do is to remind yourself that you’re not a surgeon, nobody is going to die but it’s “just” about money.

I found that thinking about this and taking a couple of deep breaths before starting to look into what happened helps reduce anxiety and be more focused, rational and effective in handling the problem.

Finally, remember also that your goal now is to mitigate, not investigate and fix it for good, so:

Rollback is better than pushing a hotfix
If you suspect a specific feature could be the cause, don’t hesitate to disable it to check!
The next working day is the right time to investigate deeper

If you forget this and just take your time to dive deep during an incident it’s:

Bad for you ➡️ personal time lost
Bad for other on-call persons who joined to help you ➡️ time lost
Bad for the company ➡️ pay you extra to work overtime

Be prepared

The biggest work of incident resolution can be done BEFORE the incident. Here are 4 things to include in your team practices:

Build (and maintain) an on-call handbook

Have a shared team handbook for on-call

Need to be the index, the source of truth of incident management
Need to be useful ➡️ links to the dashboard, links to logs, toggles, …
Need to be SHORT, FAST ➡️ you won’t have time to read
Need to be shared and maintained by the team

Write SOPs

Every time you solve a problem that:

Require you some work, like coding a script or crafting an API request
Is not extremely specific, so could be useful again

It’s time to create a Standard Operating Procedure!

This basically means the next time that during an incident there is a situation, before jumping into crafting a solution you take a look and find an SOP, then just follow the steps.

Example: An event consumer stopped working due to a bug and some critical events got lost. To solve the issues, the on-call engineers will probably do something like:

Move the offset of the consumer group, if messages are still in the broker
Force the sender to resend the messages
Manually sync data reading from the sender DB

Even though the problem was different, it happened another time that this service was out-of-sync, so you find an SOP and just follow the steps: after a few minutes, the problem is fixed.

These tasks are not straightforward and could even make the situation worse, plus it is much harder to think straight with anxiety, maybe after being woken up in the middle of your sleep. Imagine how easier it would be to find a step-by-step guide at that moment.

To Recap: common problems have common solutions

Hard to think straight during an incident
Can immensely reduce mitigation time
Do your homework (the day after, not during the incident)

Roleplaying

Take time to train (especially new joiners) on realistic scenarios!

An experienced engineer who has already managed many incidents can raise a fake incident, and then pretend to be a RTO agent that can provide details on the problem. The trainees will then need to go through logs, metrics and apply resolutions, while the expert engineer “shadows” them.

If you ever played D&D, this is what I am talking about, I found this teaching technique to be crazily more effective than any others.

Prioritize OPS tickets

This point is mainly for engineering managers or lead engineers: you need to make sure that critical ops tickets get prioritized or they will never be done.

By this, I don’t mean forcefully push them in the sprint, but have a conversation with your PM/Business team and kindly explain to them why this is important and what are the business implications of NOT doing it.

If you explained it correctly, I am sure that they will be the first willing to push them.

Team Power

This point seems very straightforward, and yet I’ve seen so many times people not doing it!

Call for help: the on-call engineers are a team, don’t try to solve the problem on your own if another service is involved, e.g. if there’s an infrastructure problem and you’re not an SRE: don’t wait! Call one!

The mitigation time can be hugely reduced and they are on-call, so you should be expected to be called not only for a problem on your specific service but for anything in the company you could be helpful solving.

Of course the same will apply to you, so help others if you want to be helped, join on-calls of others and actively help until it’s solved.

About guilt

It’s very important to set a blame-free culture in your company, not only because it’s the right thing but also because it’s more effective: people will have less anxiety and will focus more on the learnings of an error rather than the blaming.

Also, even if it was you who wrote a bug that is destroying prod, it should not be that easy to compromise a company’s service, there should be processes put in place by engineering leaders like code reviews, automatic rollbacks, automatic migrations checks, … everything that is commonly defined as “Guardrails” (ref The Staff Engineer’s Path — Tanya Reilly).

If everything else failed

A couple of suggestions to be put in your handbook, for when you encounter a difficult problem not easy to debug:

Check recent deployments (not only of your service)
Check recent feature toggle changes (not only of your service)
Try to focus on a specific case, even if millions of orders/users have errors, focus on one and deeply debug it
Call for help: call other teams, call all your team, call your manager
Talk more with RTO agents:
- Ask for more cases or if they can spot some pattern in the problem
- Consider a manual solution, sometimes it’s faster than coding a script
- Ask them for possible mitigation like closing the service or sending a message to the customers, you’re less prone to customer churn if you’re honest and declare that you have a problem and working on it

Recap

DON’T

Try to fix the problem for good.
Hesitate to call other teams.
Blame or fear to be blamed, every incident is everyone’s fault.
Even if you wrote a bug, it should not be that easy to compromise a company’s service.