Resolving a Production Issue on a Live Server

Source

Introduction

As a software developer or DevOps engineer, life is good until there is an issue in production. Sometimes it gets really crazy that we end up creating even more problems than solving the ones at hand.

As a DevOps engineer, I have had cause to experience different “fires” in my lifetime. The “weirdest” is usually when staging environment seems just fine and production environment is really acting up. Sometimes this can leave you feeling like this guy:

photo credit: me

Well, this is how you feel but the adrenalin rush will keep you on your toes, especially when you know that money is lost for every minute that goes by without the issue being resolved.

In this article, I will highlight what I consider the way to go about putting out a “fire” in production servers. Every item listed here may not fully apply to your team as every engineering team may have a different way. However, the concepts here can be modified to suit your own case.


Steps Towards Resolving a Production Issue

Notify All Stakeholders: The first step is to communicate that the server is down to all stakeholders. This can be done in form of a slack message to the general channel, an email copying relevant personnel or through any communication tool that the team uses. The importance of communication is to bring up the issue to everyone's attention and for all hands to be on deck. Sometimes, someone may just respond to your message with why the issue is happening (assuming it was a mistake from his/her part). But the goal is to inform and bring in the members of your team.

Breathe and Identify the Problem: You have received the Pingdom or Slack notification that the server is down or your customers have started complaining that they cannot access your services, the first thing to do is to breath. The initial drive would be to SSH into the instance and to start typing magical commands. This drive to quickly solve the problem is very okay but you cannot solve a problem that you do not have a full understanding of.

Identifying the problem implies having an in-depth look at the outage message you got. If this message is not clear enough, you should look at the application logs. Sometimes, the issue is one the team is very familiar with. In that case, this step can be marked as done. However, if the issue(s) is/are alien to the team, the best approach is to identify the problem and the source. It does not pay to have a theory that is baseless. What this means is making an assumption that you cannot back up with facts. For instance, if the frontend application is not displaying data, it will be erroneous to just assume that the problem is coming from the database. In this case, going to troubleshoot the database server may be a futile task and you may end up creating a bigger problem. Therefore, first identify the problem and the source of the problem.

Reference the Engineering Runbook: When the problem has been identified, the next recommended step is to make reference to the engineering runbook. Check if the runbook contains procedures for resolving the issue especially if this is an issue that the team are familiar with. Usually, it does. It may even include a guide on who next to message if you do not have enough permission to resolve the issue. If the solution is contained in the engineering runbook, rejoice for you have been saved.

What if the solution does not exist in the engineering runbook, well you are doomed.

Source: https://tenor.com/view/laugh-laughing-laughter-gif-4872552

Okay, you are not doomed. It just means you are moving up to the next step. Sorrryyy!

Bring in the Calvary: Up until now, you have been a solo soldier trying to take down a gang of bandits. I am not sure this metaphor is correct but the point is, now is the time for you to set up a call for all engineers to join and put out the fire. With everyone on the call, a lot of theories and suggestion will be flying around. This is the “fun” part because as soon as you bring up a suggestion, there are a number of people that will help scrutinize that. Why I like this part is, even though the environment is “chaotic” in the sense that everyone is really focused on the issue at hand, it affords junior engineers on the team to learn tremendously. It is like a surgical intern being present in the company of renowned surgeons trying to figure something out. Apart from the learning opportunity that it provides for the junior engineers, solid solution to the issue is gotten at this stage.

However, there is one important aspect of this part. You remember the adage that says “too many cooks spoil the broth”? That can pose a problem here. My recommendation is that one person should take the lead and probably share his/her screen while others observe and contribute. If this is not so, and everyone is trying to do one thing or the other on the server, well I am sorry but you may end up creating even far worse problems than resolving the one at hand.

Backup the System Before Implementing Complex Solution: It is a good engineering practice to periodically backup your server. However, in this case, you need the most recent backup just in case things go south. Also, it is a necessary step to take especially if the solution is complex or that the team has some reservations about the solution they came up with.

In some cases, the engineering team will set up another server and route traffic to it before proceeding. This is not applicable in every case but if it can be done and it seems like the fix is taking long, then this should be considered.

Document the Problem and How it was Resolved: You remember how you made reference to an engineering runbook above? Yeah, now is a good time to contribute to that file. When the “fire” is finally put out and the system is working fine again, add the problem and the steps you took towards resolving it to the engineering runbook. The importance of this, is that it will help the team to quickly recover from such when/if it happens next time. It also serves as a reference to the team when they want to rebuild the system to be resistant to such issues.


Conclusion

There is no system that cannot develop an issue. A server developing an issue may not be as a result of bad engineering practices. It could simply be as a result of unforeseen events going on. However, what is bad is not preparing on how to respond to these failures when they do occur. A really good way to prepare for this is to have a well-documented engineering runbook that can be easily referenced. Also, some issues can be avoided by following some good design patterns.

Thank you for reading.