Mistakes that I made on my first on-call & how I improved

Ujjwal Gupta
SquadStack Engineering
5 min readMay 18, 2022

This article is solely on my personal experience of what happened when I was first assigned to an on-call duty and the mistakes I made in carrying out that responsibility.

About On-Call Responsibility

On-Call is a crucial duty that ensures that the system always should be up and running. On-Call is a practice that we usually follow in Squadstack. As part of on-call duties, we assign a dedicated engineer as an on-call engineer (rotational). The key goal for the engineer is to ensure uptime of critical systems at desired quality levels. They take up the following responsibilities:

  1. Resolve end-user escalations & handle requests as they happen.
  2. Act on infra alerts & pages and make sure infra is up and running.
  3. Suggesting & setting up a better alerting and monitoring system (if time permits). Thereby enabling future on-call engineers to take proactive measures rather than reactive measures.
  4. (IF you’re lucky & time permits) Reducing Technical debt.

Even more important than daily work is the improvement of daily work. — Mike Orzen

Mistakes that I made on my first on-call

1. Making Assumptions 🤔

It’s so easy to make an assumption. All you need is incomplete information coupled with urgency about a situation.
Our Assumptions are not the Truth, it’s just a tool that our minds are using to conclude any given scenario.

Why do we make assumptions while we handle on-call?

  1. We might not have the full context of the system.
  2. We want to fix anything ASAP, which causes us to make assumptions.

How we can overcome this?

  1. Avoid making assumptions: might sound cliche but it is what it is; as more you make assumptions, the easier it is to continue making them.
  2. Never make assumptions until you know the right context - try to figure out the “why” behind things & get to the root cause of the problem.
  3. Conclude anything when you have proof and are 100% sure.
Ignoring errors and assuming that the system will auto-heal while handling on-call
Ignoring errors and assuming that the system will self-heal

2. Let’s Hack-fix the system 👨‍🚒

I wouldn’t call this a mistake, but it is far more than that if it becomes a habit. So, what prompts an on-call engineer to hack-fix a system 🤔?

  1. Adhoc with an urgent tag causes panic among engineers, thus the first thought that springs to mind in that situation is to hack-fix the system. They forget that it’s only a temporary patch, not a permanent one; and they’ll repeat it again and again.
  2. Too many escalations cause frustration which leads to the idea of just hack-fixing the system.

How to overcome this?

Panic is the root of all errors, so engineers must keep their cool and concentrate on the following steps.

  1. First, identify the problem and assign a priority based on whether it is urgent and how much urgent.
  2. Acknowledge the business owners and offer them a rough time frame.
  3. Based on the severity of the issue, decide whether to go with a temporary or permanent solution.

3. Too much Dependent on Your Peer 🎭

When we begin our first on-call duty, we may not have a complete understanding of what we must handle, as well as our responsibilities and expectations. So, it’s fine to enlist the help of your peers or seniors in these situations, but we often overlook the fact that we are the primary owners of on-call, not our peers, and we continue to ping them for assistance.

Why does this happen?

  1. We as a developer always takes precautionary measures whenever we handle any new responsibility. We even thought that our actions might break the whole system.
  2. Laziness is also a crucial factor here, as we don’t want to spend time figuring out the root cause of the problem.

How we can overcome this?

  1. Keep in mind that your peers are there to assist you, not as your playbook.
  2. Take assistance when it is genuinely vital or breaking or if you are having a pool of escalation.

4. Improper Prioritization of Tasks 😕

When it comes to addressing on-call responsibilities, task prioritizing is critical, and we often overlook it.
This occurs because we believe that every problem must be solved as soon as possible, or else be prepared for consequences. But, as we discussed above, we must first examine the problem and its implications before prioritizing it.

What methods can we use to examine and prioritize problems?

  1. Prioritization is a difficult process, and it demands experience; thus, we should take peer help in assessing the problem. But, before seeking help, we must first assess the problem on our own.
  2. Prioritization is entirely determined by the affected audience i.e the users which are affected by the problem.

5. The System will Self-heal 👽

The Myth that most developers assume in their day-to-day work. When they see a green light in a red blood ocean. They feel “Wooh… everything going good we don’t need to worry 😎”

How to overcome this?

Just recall my very first point — never make assumptions.

Make sure to check the reason behind the failure and whether it will self-heal or not.

6. Ignoring Errors

You’ll encounter a slew of errors while managing on-call responsibilities, some of which should be addressed and others that should be ignored. Differentiating between the two would be extremely difficult for an on-call engineer. So, instead of ignoring them all, we straightforwardly ask our peers for help without first analyzing the problem. “This looks like something unknown, let’s leave this or let’s ping peer developers; they might have a greater context on that”.

How to overcome this?

This will pass with time; the more you understand the system, the more comfortable you will be with it.

Other meta points I realized while handling On-call Responsibilities

  1. Things might be hard for you in the beginning but you should keep your calm.
  2. Your peers are always there for help if you need it.
  3. Don’t exhaust yourself too much. Take breaks whenever you got time.

Thank You for reading this article. If you enjoyed this article, please give a few claps so that it reaches more people who could benefit from it!

PS:- We’re hiring across tech & non-tech roles, check out https://www.squadstack.com/careers/ for all open positions & for sneak peak into our culture;

--

--