On-call. Don’t be scared.


As an engineer for a software development team who participates in an on-call rotation, I am constantly reminded of the following section from “A Tale of Two Cities.”

“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us…”

I have had the honor of working with many fine engineers and system administrators over the years, and one universal truth stands out; being on-call can be a real downer. I am really sugar coating that last statement because I have seen many people lose their cool over the interruptions that come with being on-call. This isn’t just the opinion of one man either, take a look at this report from VictorOps about the state of on-call.

If you have never had to participate in an on-call rotation, here are a few examples of what you will experience.

  • Oh you’re working on a new JavaScript framework that will improve team productivity? Well put that on pause because there’s a Production alarm indicating that customers are experiencing slow login times. Get on it.
  • Planning on going to bed early after a long day of work? False. Looks like the database is having some replication issues and we need you to address this large collection of alerts. Thanks.
  • It doesn’t look like you’ll be able to meet your friend at lunch for some delicious Thai food. We’re seeing an increase in Django errors in Production that you need to investigate immediately.
  • Sorry to interrupt you while you are at your daughter’s soccer game but we are having an issue with server XYZ in Production and it is unclear how to proceed because the support documentation doesn’t cover this scenario.

I could go on, but the point of these scenarios is that your life will be interrupted. How you deal with these interruptions is up to you.

Reality

  • Being on-call is not a punishment. No one will argue that being on-call is one of the crowning achievements of man but it is definitely not the worst thing you could be doing.
  • If the application/environment you are supporting requires an on-call rotation, an organization has deemed its operational readiness to be paramount. Customer confidence, community standing and financial outcomes depend on your ability to handle the situation. People are depending on you.
  • There will be good rotations and bad rotations. More often than not, you will be on-call and life will go on as it always does. Systems will function as designed and everyone will marvel at your technological feats of magic. However, there will be times when the sky grows dark and everything that can go wrong, will go wrong. When this happens, just know that even though the problems are not your fault, it is your responsibility to take ownership of the resolution.
  • Your personal life will be interrupted. This is not the ideal situation but it will happen. You have been warned.
  • Most of your on-call wizardry will be done without applause. You saved the day; now let’s get back to work and make some changes to prevent these issues in the future.

Expectations

  • Stay calm. Breathe. Getting yourself all worked up isn’t going to help you fix the problem at hand any faster.
  • Being on-call means you are the point of contact; you should be available. No one likes going to bed with their phone next to their head or having to check for alerts while watching their favorite television show but this is part of the job. Are you a deep sleeper? Me too! Doesn’t matter when we’re on-call, it’s our job to answer the phone.
  • Effective communication. If there’s a problem and customers are impacted, articulating how many are being affected, steps for mitigating the issue or ETA for resolution is a highly valuable skill. In addition, if your organization has an Incident Response team that coordinates communication to customers and other internal teams, keeping them informed is just as important as resolving the problem.
  • “It’s dangerous to go alone!” Troubleshooting problems in the middle of the night can be scary. You’ve just been woken from a deep sleep and are expected to identify what’s breaking your environment, fix it and communicate effectively. All of this needs to be done post haste whilst sitting on your couch, in the dark, in your pajamas. Take sixty seconds and call for backup. A proper on-call rotation should have a secondary member. Two heads solving a problem is better than one.
  • Know your environment. Tomcat? Apache? Nginx? Cassandra? MongoDB? Do you have any of these services in your environment? How are they configured? What ports do they use? Where can you review their logs? How does your application react with the failure of one or more of these services? How will this impact customers? Regardless of the technology, you need to have more than a basic understanding of the functionality and integration points of the services for your application.
  • Talk about off hours incidents. Oh you were called last night for a failure? Bring it up. Tell your teammates about what happened and the resolution. Have they run into this situation before? Perhaps this problem is completely new to them and they want to know more. Bringing up issues increases the likelihood of eliminating this problem in the future and making on-call a better experience for everyone.
  • Sharing is caring. One of the best ways improve team knowledge about the your environment is to share and teach. Do you have a network diagram? Where can we get list of servers for each environment? How do servers X and Y communicate with each other? Conducting informal teaching sessions to teammates is an easy way to elevate and reinforce a teams understanding of how your application works in reality.
  • You won’t be able to resolve every problem. Third party service down? Failed provider line between data centers? Not much you can do but communicate, mitigate as best you can and monitor the situation.
  • Escalate. Is this an outage that your leadership team should know about? If the impact is widespread or the issue is severe, don’t hesitate to escalate. A good leadership team will be grateful for the heads up. No one wants to be surprised about a critical issue that took out Production for two hours when they start their day.

Make it better

  • Poor monitoring and low visibility into application/system performance will lead to frustrating on-call rotations. Do you participate in ChatOps? Do you use Graphite? Logstash? What kind of monitoring solutions do you use? Icinga? DataDog? If you can’t accurately identify failing services or what errors your customers are experiencing, your on-call rotation is going to be filled with estimations, speculation and pain. Work to eliminate those uncertainties with monitoring dashboards and application performance telemetry. These aren’t fly by night solutions and will require dedication and maintenance but their payoff will be worth it.
  • Create runbooks. This is pretty straight forward yet I’ve seen teams do a really poor job on documenting how to resolve failure scenarios. Eliminate the guess work and link your runbook documentation right in your alarm notifications. We actively do this for our applications and our first level support teams really like this approach.
  • Practice problem resolution through game days. Recently I worked with a development team and put them on the hot seat for resolving environment failures. What happened to this Apache service? What’s wrong with this load balancer? Experiencing and resolving alerts is an acquired skill. It takes practice and time for people to get comfortable solving problems under pressure. Plus, you’ll get better by answering questions, inducing failures and sharing your knowledge.
  • After the dust has settled, perform an RCA to document lessons learned and how to prevent future issues. This should really be non-negotiable. Blame free discussions on how to make your systems and applications more resilient to failure, tighten documentation and knowledge gaps are key to making future on-call rotations better for everyone and most importantly, improving the customer experience.

Are you scared?

Don’t be. While being on-call can be challenging, it is also very rewarding. After a while you will have a deep understanding of your systems, environment and external dependencies. That knowledge and experience will make you a valued team member capable of working under pressure with a very valuable set of skills.

Enjoy!

@thematthewgreen