A calm sysadmin

Tungdam
Coccoc Engineering Blog
4 min readJan 7, 2021

I promised my team to publish a blog post within last week, to actually meet our humble target for public blog posts last quarter ( We’re trying to write more but are all lazy ). This should be an article to continue my tracing series or a new finding about how dirty page write back affects our latency sensitive java app recently, but it ended up like this. You can call this a short prose about sysadmin’s life.

The idea of a calm sysadmin came from the last afternoon of 2020, when everybody slowed down to enjoy the last day of the year, there’s no deployment requests, no hard discussions, no bug, much less notifications… Everything looks calm and peaceful.

But that’s not the calm I’m going to write about now. What a sysadmin really needs is calm under pressure. Imperturbable. Only being calm can help us to handle the situation ( usually urgent one ) properly instead of screwing things up. That’s is the first and foremost requirement before entering any incident.

Here’s my compilation in a random order about why we should stay calm no matter what the situation is. They are what i learned hard over the years. You may want to update it yourself.

  • To not mad at someone’s mistake. Firstly, because everybody makes mistake. Secondly, that mistake sometimes turns out its your settings / commits / changes several months ago :)
  • Listen to other’s suggestions / solutions carefully even though it may sound crazy at the first glance. Otherwise, people will throw this at you.
  • To not fix your long-time bug immediately but consider the outcome with third parties before applying changes. Sometimes your bug works OK with other’s bugs :) . Fixing it right now without correcting others may lead to catastrophic situation. Trust me, I experienced it hard.
  • To note every changes you made throughout the emergency situation, and revert ones without clear impact.
  • To have a walk, take a shower or sleep after hours of debugging without any progress. When you look at the top screen more than 5 times per min, it’s the signal. Our brain needs some rest.
  • To not apply your changes to production at weekend even though you’re sure that it definitely can help to boost your system’s performance 2 times. First, you violate your rule about changing things on production. Second, you may ruin your team’s relax time.
  • To read references on Stackoverflow thoroughly :) , to not just blindly trust the solution from a random guy on the internet. If the situation allows, we should better read a paper or a long article instead. But Stackoverflow is still very helpful :)
  • To pick a right metric among dozens of others that may lead to a wrong direction while chasing the root cause.
  • To check the client side when your bosses can’t connect to our company’s homepage while our monitoring system reports nothing ( This is actually pretty tricky, you better understand study your bosses more before apply this advice :D )
  • To get help. Know your time limit while debugging production issues on your own, escalate quickly when such limit’s exceeded. Call Subject Matter Expert by phone in urgent case. Calm != slow.
  • To let your team solve a problem themselves in the midnight and go to sleep, because you know that you have to cover them in the morning.
  • To guide your team through an emergency situation by listening to their input / report before making order[s].
  • To truly see the problem / bug / issue you’ve been stuck with for weeks is actually a great chance to learn new things. It’s NOT OK to let a production issue happens more than once, but with a calm mind and a positive learning spirit, “it shall too pass”.
  • To really conduct a blameless post-mortem report. Especially when you’re the guy who made the mistake, it’s 10 times harder :)
  • To let the juniors struggle and make mistakes. They need room to breath and learn.
  • To pause to analyze more before telling your subordinate that he’s not doing well. And to tell him so when it’s true, with a warm tone of your voice.
  • To not kill your fellow developers when she/ he requests to deploy at 6PM on Friday, 31/12/2020

Once again, this is what i learned i should do, not what i master as a calm sysadmin. People who know me can easily spot out that I’m not a calm person in real life. But over time, i think we can practice to be a calm and thus better sysadmin because the industry sometimes need us for super urgent situation, just like fire fighter in real life.

Let’s prepare for the worst , stay calm to be precise in emergency situation. Again, thus to be a better sysadmin.

--

--

Tungdam
Coccoc Engineering Blog

Sysadmin. Amateur Linux tracer. Performance enthusiast.