‘99.9999%’ uptime — it’s an illusion.
say thank you to the engineers
The chime was faint, but my eyes flew open instantly. Before I could process where the sound originated, my phone was already in my hands.
Even in my sleep, the Google Hangouts *ping* notification always caused the same knee jerk reaction. It would be many moments before my mind caught up and noticed the glaring screen in my face. I blinked multiple times to adjust to the brightness.
2:16AM. I have time, I noted with some relief. Even the early risers would not be online until at least 4AM PST. That gave us approximately 2 hours — not a large window, but better than many other scenarios I could have awoken to.
‘I’ll be online in 5.’
I send the response before tossing my phone aside, staring blankly at the ceiling, sighing with resignation. I didn’t need to read the message to know it was some variation of: we have an emergency.
Despite my mental grumbling, I groped blindly under my bed to grab the laptop I kept there for exactly these circumstances. With any luck, I could remain snuggled in my cocoon of blankets and this would be resolved quickly with little discussion. Maybe I could even squeeze in another hour of sleep.
5:25AM PST found me sitting on my couch in my pajamas, a blanket draped around my shoulders. My laptop was plugged in, charging. My engineers, working remote from Germany, Russia, and Washington, were in a conference call virtually scratching their heads, diagnosing the problem.
My phone rings. It’s my Tech Support Specialist — the sole individual assigned to the early morning shift.
“I need back up. I’m remotely logged in to 3 client servers right now. I don’t know what's wrong with any of them, but I haven’t found a fix. I think some type of systemic failure happened. There’s a massive number of support requests coming in every few minutes. And the phone has been ringing nonstop.”
2AM. 5AM. 9PM. 11PM. Weekdays. Weekends. No matter how late or early, the team learned not to apologize when they reached out, per my request. As the highest point of escalation, the situation had to look pretty bad for me to get a phone call in the first place. Formalities would only be a waste of time.
Sometimes it was a false alarm. Sometimes it was a quick fix. Other times, it was a big problem without a solution, or even a proper diagnosis. On the rare occasion, the software would start working on its own, seemingly fixing itself, without an obvious explanation.
Our engineers are working on it.
It always boiled down to some version of this — our engineers are working on it. But truly, it was often difficult to offer more context than that. Hopefully, an in-depth explanation wouldn’t be necessary this morning.
8:50AM PST. I flash a quick smile when a coworker greets me with a chirpy ‘good morning’ as I walk briskly past the kitchen towards my office.
- Status: general functionality restored
- Impact: 36% of clients affected
- Root Cause: TBD
The office was a short 10 minute drive from home. The proximity meant that I was never offline for very long. I appreciated this fact, more than anything else, on these early-start mornings.
Once in awhile, when 10 minutes was too long to be offline — I’m ashamed to admit — I’d switch the conference calls to my cell phone so I could continue providing direction while I drove to work.
Stifling a yawn as I reach my desk, I set my laptop down and connect to a second monitor, giving a curt nod to my team as a greeting.
“Give me an update. Tell me something good.”
My mind began running over the list of things that needed to be done an hour ago. Trying to focus my thoughts, I take a bite out of a granola bar and strain to hear the update over the sound of my own chewing.
“Good news. We’re back online. Don’t think too many people noticed. We figured out how to get it working again and fixed all the clients that called in. Bad news. Can’t tell who else is affected unless they reach out… and we still don’t know why it went down.”
An account manager strolls in a little past 1PM. A client could not get a hold of Technical Support this morning. They left a voicemail but did not hear back. This was unacceptable.
The technical support specialist turns to me with a defensive expression. The developers were working on a long term solution. There were too many phone calls. This client was fixed. Why doesn’t the account manager start working at 5AM and return calls?
Internally, I sigh. Externally, I smile. Uptime was my responsibility. This placed me directly between the “technical” and “non technical” staff. A delicate position to be in.
I likened my department to an emergency room. First things first, always stop the bleeding. Use short term, band-aid solutions if necessary. Stabilize the client. Find the root cause. Develop preventative measures to avoid recurring errors and glitches. Rinse and repeat with the next “emergency.”
After a quick stretch break, I begin drafting the post-mortem update to be sent out to clients and stakeholders. My head hurts from sleep deprivation; it takes me nearly 2 hours to finalize the verbiage and send it to the CEO for approval.
“You’re still here?”
I glance up from my laptop at the voice in the doorway. A colleague from a different department was heading out for the day. I smile wearily and reply that I was going to leave within the hour.
It was 6PM and I was finally getting to my “regular” emails. As much as I would have loved to leave them unanswered, I knew that neglecting them would only result in playing perpetual catch-up.
This wasn’t a typical day. But it wasn’t a rare occurrence either. Behind the promised 99.9999% uptime was countless nights and days of high stress, frantic conference calls, and 16 hour days — all the while projecting a calm demeanor.
Sometimes I wake up to phantom notification sounds… Thank you for reading! Still challenging myself to share my experiences more in 2018.
Alternatively, feel free to follow me on Medium or leave your comments below. Curious to hear your own experiences. Can you relate?