Can We Restart this Server?

Jack Yeh

Published in

TeamZeroLabs

5 min readJan 17, 2021

The answer may be scarier than you can imagine.

This one server has been up since 2017

Bare-metal servers and virtual machines are wonderful things.

They are provisioned, you set them up with your favorite scripts or configuration stack. Double check that logs are forwarded, and applications are in good health, then you present the service to your team, and forget about its existence after a month.

One day, you realize that it needs to be patched and rebooted. Customers are complaining that services are running slow or timing out, and you mindlessly check the server health dashboard in Grafana.

Hmm… 5% memory available, that application should not be using that much RAM.

You log into the server. Do a quick top , shift + m , find the offending application process and in the moment of poor decision, terminated it. And come to find out that its external dependency required on process start no longer exists. It has been years since you have logged into this server.

You panic, and look for ways to get the program running, but eventually decided to phone it in and let the team know that this host and its deployment need to be updated.

With the dependency updated, you come to find rough edges in the surrounding environment’s tooling. Each requires updates and depends on new packages, which depends on newer system level packages, and a restart is looking more and more likely.

It’s One Reboot, How Bad Can it Get?

Below are some of my personal mindless reboot mishaps:

Server is encrypted with eCryptfs/LUKS, and the passphrase for the unlocked drive is kept by someone who is no longer with the company.
Software RAID drives goes missing. (GPT header were not cleared out)
Services do not boot up correctly due to incorrect permission on mount permission.
Hostname did not get set correctly and application fails to join clusters.
System fails to boot.
Cron Boot scripts failed silently due to Path issue.
Scary messages in /var/log/boot.log.
Pulling in extra weekend shifts due to lack of DR failure scenarios and rebooting on Friday afternoon.

Luckily: these are all avoidable mistakes.

Use a team share password manager for passphrase, ensure the drive key file is backed up somewhere secure. Get familiar with troubleshooting via Serial console when system does not get far enough to run SSH daemon.
Review drive partition header table for RAID disks before reboot. Double check /proc/mdstat
Put a boot up script in place to set the correct permission on application folders.
Set hostname with the appropriate tool (hostnamectl/setting etc/cloud/cloud.cfg)
Test major OS upgrade in a spare system or staging environments.
Review Cron scripts, remember they lack your current user’s ENV variables.
Make time for reading /var/log/boot.log, it actually can reveal existing problems with boot sequence.
Setup a reminder for rebooting things before they hit the 6̶ ̶m̶o̶n̶t̶h̶,̶ ̶1̶ ̶y̶e̶a̶r̶,̶ 3 year mark.

Or, I can switch to Docker containers

Stateless containers can be rotated and restarted usually with not much hassle. If you have a couple of dedicated instances, I tend to find that there will be namespaces and containers that SREs have forgotten about that keeps on running after months and years. When they are finally restarted, we may find that they also have a hard time coming back up and requires more attention.

Programs failing to start is a symptom.

The actual issue is letting critical components become unmaintained due to staffing changes or lack of interest/bandwidth. Components that work stably and well tends to not get the attention it deserves. Soon, we find that in the team, no one knows how to support it properly anymore if an actual issue comes up. Worse yet, bugs for older software will result in an upgrade that may not be compatible with existing configuration and setups.

How can we avoid these traps and maintain a healthier system?

A Laundry List of SRE Tips, in No particular order:

Use less components — The more dependencies and variations you have, the more pieces will require maintenance and monitoring.
Use stable components — If something has not yet reached V1, do not let it become a critical piece. Chances are V1 will have breaking changes.
Monitor Up Time — Prometheus Node Exporter keeps tracks of up time like this: node_time_seconds{instance=”$node”,job=”$job”} — node_boot_time_seconds{instance=”$node”,job=”$job”} . Make a Grafana leaderboard to show which server in the organization hasn’t been rebooted. Better yet: send an alert via Alert Manager after 6 months.
List reboot as a critical DR scenario — If you are offering an internal service, make sure you can reboot things without manual intervention and document the outcome somewhere. Chances are: you will find surprising results after a reboot. This leads to new tasks but it makes the overall service more robust.
Identify critical components — Just because something has worked well does not mean we can forget about it. Critical paths that directly impact the bottomline deserve to get extra maintenance windows. Read their logs, explore what metrics these important components offer to better our knowledge so we can be ready when it breaks.
Document and root cause all errors — If something can fail in a strange circumstance once, it can always fail again down the line. Your strength as a SRE is not how many services/hosts you can keep healthy, but how fast you can react to errors and downtime, and nothing is more expensive than learning from scratch during an outage window.
Raise concern for lack of owner — If the central/critical component is no longer being watched over by anybody, now is a good time to bring that up. Yes, even with no recent outages. Especially with no outages, because it means we are likely to falsely believe that said component will work without maintenance in the foreseeable future.
Keep things simple — This point is worth re-iterating. Operators are people too and the amount of experience we can hold in our brain is finite, limited to how fast we can remember keywords and google/search in Confluence. You may not need kubernetes yet for your solo Project. You may not need containers or server less deployment.
Listen to your customers — Customers pay for the service and its maintenance. Without talking to them, we will become out of sync with their changing expectations. They may not need 99.9999% up time. Maybe they need response time to be stable on mobile devices. Different customers have different needs on their journey. You cannot have a healthy system when no one meets with customers and keep things running in a vacuum. Stop guessing and just ask, so you can provide the best experience.

Wrapping It Up

In my opinion, legacy systems are the most interesting systems out there. They have been somewhat successful in serving customers for a couple of years/decades, and we can learn a lot when we maintain them. You are also more likely to encounter exotic bugs/behaviors with an old system. Getting to the bottom of it will give us a lesson in fundamentals like no other.

I will see you at the next reboot.