Production Incidents: 5 Learnings From The Trenches

Published in

Javarevisited

3 min readAug 3, 2021

Oh oh — production alarms are ringing and you are on duty. Adrenaline spikes and you rush to your computer to see what is happening — sounds familiar? I’ve been in this situation several times and I want to share some things that I’ve learned so far.

Have an emergency rollback strategy for high-risk deployments

All deployments have different levels of risk. Some are minor bug fixes, some are major DB changes that affect the whole application. Figuring out a process to roll those changes back and return the system to a consistent state is a great exercise to do beforehand. During the incident, it will take considerably more time (since everybody is panicking) and some options might not be possible anymore, e.g. if you dropped a wrong table and you don’t have a backup.

Do not hesitate to involve the support of 3rd party systems

In my experience, most of the production incidents are caused by one of the recent deployments; however, several times one of our 3rd party dependencies (e.g. cloud providers) was at fault. Don’t wait until you are 200% sure that the fault lies outside of your application — if after checking all the potential causes of the issue, relevant logs, and metrics you have at least 80% certainty that the fault lies in a 3rd party system, call their support for help. The best-case scenario is that they will solve the issue, the worst — you get valuable knowledge that the problem is elsewhere.

It is also useful to be aware of the Service Level Agreements (SLAs) that you have with these 3rd parties and the support request response times that they guarantee. For the most critical 3rd party integrations, it might be a good idea to have a premium level of support in order to be able to solve critical issues quickly to minimize the effect on your customer base.

Be transparent with the customers & communicate the progress

If an incident takes longer than 10–15minutes to fix, it might be a good idea to notify your customer base. This shows your customers that you are aware of the problem and are actively trying to solve it. It could happen that they do not even know about the issue yet, but it will boost your company’s image to show them that you are diligently observing your product’s health. If the incident is taking even longer to resolve, update your customer base about the progress.

I’ve learned that customers are far more understanding than I had expected.

Appoint a person during the incident resolution to log everything

Investigating more serious/puzzling incidents usually means involving more and more people. The joiners will need to be brought up to speed, and in such high-pressure situations, it is best to avoid wasting precious time explaining things over and over or even worse — having people do something that has already been done. Therefore, as soon as an incident commences, appoint a person to oversee the process and log the following information (with timestamps and outcomes!):

what has been checked (recent deployments, logs, metrics, etc);
what has been tried (redeployments, traffic throttling, etc);
who is doing what?

This log will also be extremely valuable to identify what parts of the process can be improved and can also be used during the writing of a post mortem.

Learn & improve from the incident

Ask the main participants of the incident to reflect on the incident and then schedule a meeting to discuss what can be improved in terms of monitoring, response, problem investigation, etc. There are probably some improvements that could be introduced in order to avoid these kinds of incidents in the future or minimize the time spent on identifying/fixing the issue.

Most importantly, remember that production incidents are inevitable — keep your calm, do your best and do not be afraid to ask for help.