Incidentally #4

Puneet Awasthi
4 min readOct 16, 2023

--

How outages happen and how to prevent them.

(This is the fourth installment of a series I have started. If you have been here before, welcome back! Each part is fairly independent but here are the links to part 1, part 2, and part 3 if you are interested.)

I have worked in technology operations for years, witnessing costly mistakes that caused not only stress to the responding teams but also financial loss, regulatory compliance issues, and customer dissatisfaction leading to loss of revenue to the company.

As they say, an ounce of prevention is better than a pound of cure. In my experience whenever the root cause of any incident is determined, it’s rarely earth-shattering. Simple mistakes that can be easy to avoid at the outset, are the ones that cause the most damage when left unfixed. These can be coding errors, procedural errors or even missing control. These might seem like they can all be dealt with common sense until you’re face to face with them — trust me, I would know.

There is a problem because we have more customers than we expected. Whose sanity check are we talking about?

Example 4: Poor Maintenance

If you are a Freakonomics fan like me, you might have encountered the episode called in-praise-of-maintenance. It starts with “…our culture’s obsession with innovation and hype has led us to neglect maintenance and maintainers…”.

It is easier to keep your house tidy every day than make it a weekly project that scares you and gets postponed until, eventually you need to call a professional cleaning service! Keeping a regular maintenance schedule reduces the chance of a crisis stopping your business operation, and requires emergency measures — and involving incident-management professionals!

Maintenance should not be considered a second-class function by any means. Remember you build the product once, deploy it once, and then it gets used over and over(hopefully!). Therefore one could argue that ongoing operational maintenance is as critical as original implementation. In the podcast mentioned, they put it really nicely. “…The value of engineering is much, much more than just innovation and new things. Focusing on taking care of the world rather than just creating the new nifty thing that’s going to solve all of our problems…”

Maintenance could be many different types, and none are particularly tough to implement. Let’s look at some cases:

  • Space management — Log files, database tables, and application-generated temp files grow over time. Without having proper archiving and deletion policies in place, disk utilization goes up and eventually fills up causing applications or even the servers to crash. While the cloud providers offer auto-extension for many services, wisely managing these resources is critical. This requires careful thought and regular review to ensure that archiving policies are still in line with the available space and changing application needs. Another variation of this problem is the number of file descriptors. Even if you have space having too many hanging files in a directory can result in an issue.
  • Patching — It's essential to keep your systems updated not only for the latest bug fixes but also for security reasons. If you don't have a proper patching schedule in place, your application could be vulnerable to exploits, resulting in downtime and business loss. Additionally, using older versions of open-source or vendor software increases the risk of failure. In some cases, vendors may not be able to provide support if you're still using an unsupported version. Therefore, it's crucial to stay up to date with the latest versions of your software and ensure that your systems are always patched.
  • Alert Optimization — One of my favorite topics is getting alerts and exception thresholds "just right," and I agree with Goldilocks on this one. It's a challenging task, but it's essential. For instance, if you set up an alert for a timeout at 1 second, but you no longer require that kind of response, you must fix the alert to avoid unnecessary alert noise and fatigue. Another example is the so-called "sanity checks." If you expect a file input from a vendor and you want to build strong checks around the sanity of the incoming data, the file must have between 100 and 100K records, or else you trigger a failure and a manual workflow. However, if the business changes and 200K becomes the new normal, you must keep up with these configurations to ensure that all exceptions and failures are in line with the current expectation and not triggered unnecessarily.

That's all folks. I will write another one when I (or you) get an idea.

Have an incident-free week.

--

--