Incidentally #1

Puneet Awasthi
2 min readSep 3, 2023

--

How outages happen and how to prevent them.

I have been working in the technology operations space for a long time and I have seen things. Things that cost companies money, lost their customers, and even forced them to change strategic direction.

As they say, an ounce of prevention is better than a pound of cure. In my experience whenever the root cause of any incident is determined, it’s rarely earth-shattering. Simple mistakes that can be easy to avoid at the outset, are the ones that cause the most damage when left unfixed. These can be coding errors, procedural errors or even missing control. These might seem like they can all be dealt with common sense until you’re face to face with them — trust me, I would know.

Example 1: Missed where clause in SQL

The only time it pays to be less inclusive!

SQL is all over the place. Information is stored and managed in database tables using SQL queries. A where clause tells to restrict the action to a small set of records identified by a conditional statement. Now getting this wrong or more specifically omitting it is a really simple problem with big consequences. Here are a few ways things can go wrong:

  • Updating all the records in the table, resulting in data corruption
  • Deleting all the records in the table resulting in data loss
  • Bring the whole table into memory resulting in an out of memory errors

You think it’s silly, and perhaps it is, but happens more often than you think.

  • When running commands manually in an SQL editor, you may highlight partial code and execute it by mistake
  • When building a dynamic query, the code path to add where clause is not executed.
  • When creating a new function it was just easier to just pull the whole table into memory as it was insignificant in size. As the data grows the problem becomes bigger and bigger until the process runs out of memory

These types of errors can be avoided by having proper discipline and controls in place.

  • Have Maker-Checker roles to ensure queries are run correctly and completed as intended
  • Ensure transactions and confirm the right number of records were updated, otherwise rollback the transaction
  • Use query scanning tools to catch any where-less queries in your system

Planning on making this a series, so stay tuned for the upcoming articles with more examples. Have an incident-free week!

--

--