The road to zero down time is a bumpy one…

…But it doesn’t have to be!

Published in

Zenduty

3 min readDec 19, 2018

Talk to most on-call engineers, and they have the same story to tell- their job is tough. The anticipation of a phone call keeps them up at night; and once the phone actually does ring, it’s a scramble to (often single-handed) get the service up and running. Despite adopting some of the best tools available today, ITSM (IT Service Management) is usually a chaotic, cluttered space.

In this article, I will outline some major problems that on-call teams face, and give you some glimpses into a solution.

Your systems don’t talk to each other

In the recent years, the DevOps phenomenon has focused on breaking down silos that teams traditionally work in. In fact, the term DevOps signifies a closer relationship between your Developer and Ops teams, with engineers having a close relationship with all parts of the Software Development life cycle. And while we are breaking down walls between people, what about those that exist between your tools?

When your payment gateway goes down at 3AM and you are on call, what do you do? You try to single out the roots of the problem, wake a few people up and get them to run diagnostics on their systems. If you woke 3 teams up, the chances that at least two of them will be quite annoyed at being woken up for their perfectly running system are quite high. What on-call teams need, are diagnostics systems that are automatically triggered when an alert kicks in, and will summarize and attach to the alert their analysis of the problem without human intervention. This means that while your mind wakes up after a critical system goes down in the wee hours of the morning, your incident management bot, like a good assistant, is already preparing reports to help you drill down to what exactly is broken. When systems talk to each other, sleepy humans don’t have to.

Getting the original developer involved is a time-consuming process

“Why did this otherwise healthy system fail?”

This is the first mystery every on-call engineer summoned to duty tries to solve. You try to figure out what went wrong, and when. This process becomes harder when documentation is hard to follow or new changes are undocumented. Scanning through all the recent git pushes/builds, trying to figure out what, if anything was done recently to affect the system and who is responsible is an weary task.

However, this is a process that can (and should) be automated. When you receive an alert, your incident management tool should equip you, and other relevant stakeholders with adequate context. You, and the software engineers responsible for the service in the day hours should have ready-to-use information on recent builds, infrastructure load patterns, or any other context you may need. You save precious hours going through someone else’s code trying to figure out what went wrong while trying to wake up who you think is responsible, when automated systems should be doing this for you.

Conclusion

In the end, waking up to troubleshoot a system with a time pressure is never going to be an easy job. Organizations must create a culture of automation that prioritizes on-call engineers and reduces their effort to get the resources required for the job. Cross-application automation is the way to to make your on-call engineers sleeping soundly.

With an “engineers first” mindset, Zenduty is an upcoming revolutionary incident management platform that places greater emphasis on automation and seamless communication between tools, people and technology. Sign up for a private beta here or fill the form below.

The road to zero down time is a bumpy one…

…But it doesn’t have to be!

Your systems don’t talk to each other

Getting the original developer involved is a time-consuming process

Conclusion

Written by Zenduty