Incidentally #3

Puneet Awasthi
3 min readSep 17, 2023

--

How outages happen and how to prevent them.

(This is the third installment of a series I have started. If you have been here before, welcome back! Each part is fairly independent but here are the links to part 1 and part 2 if you are interested.)

I have worked in technology operations for years, witnessing costly mistakes that caused not only stress to the responding teams but also financial loss, regulatory compliance issues, and customer dissatisfaction leading to loss of revenue to the company.

As they say, an ounce of prevention is better than a pound of cure. In my experience whenever the root cause of any incident is determined, it’s rarely earth-shattering. Simple mistakes that can be easy to avoid at the outset, are the ones that cause the most damage when left unfixed. These can be coding errors, procedural errors or even missing control. These might seem like they can all be dealt with common sense until you’re face to face with them — trust me, I would know.

Example 3: Release Management

But it worked on my computer!

When you start HelloWorld.Java (or equivalent in your favorite language) it’s pretty straightforward to write, build, and run. Any specific piece of business logic by itself is also not complicated. But real-life software development is complicated because we use libraries and tools built by others in the company as well as from open source, all of which can be buggy, subject to change, or become incompatible with your changing software. Apart from the software itself, there are configuration, runtime, and deployment process complexities.

Software version compatibility is critical for a successful release. You may have tested with a system library on the test server that differs from the one on the production server. How about the database, middleware, and other components? As discussed in Part 2, making API calls to a service with (now, post-release) incorrect API contracts can cause issues as well. Some examples include data type or length changes, different precision, or even as simple as a magical new field in the result set.

In the world of manual provisioning, there is the risk of the production runtime not being similar to the test environment because of the setup being done ad-hoc. Recreating a lost host manually can be prone to errors. Infra as Code (IaC) has been extremely useful in making the runtime more predictable and repeatable. However, you need to be very sure that production-specific code is doing exactly what you want it to do (say you want 3 nodes in prod, but only one in test)

Configuration management is still another constant source of incidents. This is when the production configuration is expected to be different from the lower environment where the software was tested. For example, in the test environment, you may be mocking the stock exchange connectivity but when you go to prod, you must make sure you are connecting to the correct stock exchange server.

And lastly, the issues with the process of deployment itself can cause serious damage. During the deployment, did you make sure to stop or redirect the client traffic? If the release goes wrong and any active traffic gets impacted how do you plan to replay it? What is the rollback plan, have you tested it exactly the way it is going to play out in production? It is essential to plan these in great detail and have multiple team members with the ability to perform the rollback. This will give the dual benefit of reviews as well as additional coverage because my guess is the primary release person is going to be very tired, stressed, and prone to making mistakes in this situation.

Bonus: When I was in school, I was really good at Math but always missed a few points because of “silly mistakes”. My teacher told me to leave a few minutes to double-check my work before submitting it. It made a huge difference and got me closer to an A+. The same applies here. No matter how perfectly you executed the plan, do not forget the power of comprehensive post-release checkouts.

What topic should we cover next week then?

--

--