Everything Everywhere is on Fire (in production)

Lucas Coppio
syngenta-digitalblog

--

A quick “how to” on properly dealing with incidents and keeping people calm

If you’ve taken an interest in this article, you’ve likely already experienced or are currently facing a critical situation concerning software production and want to understand how to deal with these setbacks. But rest assured that you’re not alone, as production catastrophes are constant challenges. After all, we all rely on software at some level, whether it’s a highly complex and distributed system or a monolith running on a single server.

To address an incident, a well-defined and flexible process is required, one that can handle the full range of different ways things can break. In this article, I outline how we’ve standardized the incident response process, exemplifying an incident I handled in 2017. The standard incident procedure comprises four stages: gathering clues, investigation, recovery, and closure.

Stage 1: Gathering Clues

At nine-fifteen in the morning, a colleague enters the engineering room asking if we know why the product is running slowly. I check the application logs and see errors popping up. The metrics in Grafana show that the requests are slow, but there’s no increase in CPU usage; on the contrary, it seems to have dropped. RAM usage has not changed either.

The first stage of the incident management process is dedicated to gathering clues. In this phase, it’s essential to understand what is happening and how, which usually involves multiple people and opinions. This stage is often chaotic, as nobody is sure what’s going on right after the incident. Thus, the team must work together to collect clues and build a clear picture of the problem. As the number of people involved increases, a clearer view of the incident can be achieved.

Examining the logs more closely, it appears that all errors are focused on communication with the database. The resource usage metrics for the database are good, and there is free disk space, but something strange is happening: the data read metrics on the disk have increased significantly, and the response time of the queries has also increased. My manager took the initiative to inform other departments in the company about the situation and delegated the responsibility of communicating with customers to the customer support team. Meanwhile, I’m noting down on a sheet of paper the actions we’re taking and the times we’re starting each.

However, having a complete view of the problem is always challenging, so it’s necessary to use additional tools to gather more facts, such as system health metric dashboards. This can involve analyzing indicators like CPU, disk pressure, RAM, and network latency — these indicators provide essential data for investigating the incident.

Other crucial information in this situation may include the deployment of a new version or a recent configuration change and the activation or deactivation of a feature flag.

Stage 2: Investigation

A large migration and a deployment of a new release were carried out the day before, but the logs showed that the issue with the database was widespread. My colleague tried to access the database through the IDE, while another attempted to SSH into the server hosting the database, but neither succeeded.

Once the facts have been gathered, which is usually relatively fast, the team should be reduced to the number of people dedicated to investigating the source of the problem and mitigations. It’s also essential for someone to be responsible for keeping the “log”, noting down the findings and paths being investigated, along with the time each task began, and communicating the incident to all stakeholders.

Additionally, professionals should be dedicated to different lines of investigation, such as log analysis, error reporting programs, and request traces. During this stage, one or more people should be involved in the incident mitigation actions, such as the possible rollback of deployments, returning feature flags to their previous states, or even providing additional resources for the environment.

We performed a rollback of the application and retrieved the database backup made before the migration, in addition to the daily backup. After several attempts, I managed to log into the database server via SSH, and there I tried to initiate debugging procedures on the machine. Still, they all failed, leading me to believe the computer’s SSD was failing. Each of these results is quickly noted down on the paper I have with me. Now the manager has set up a WhatsApp channel to communicate with key people from other departments, aiming to align the communication that would be relayed to customers.

Each person should have a specific role at this stage, which helps to accelerate the identification of the problem and find a solution, and those who have no tasks to perform don’t need to participate. This is a tense moment, and the team needs to focus.

Stage 3: Recovery

With the backups in hand, we created a new server, quickly installed a database, and started restoring the backup. Another person was dedicated to taking the application offline and attempting to extract a copy of the database. Fortunately, this strategy worked, and we could recover some of the data that was added after the backup. Another person was responsible for obtaining the log of the data entered by customers on the day since the backup was made and preparing them for reinsertion into the database.

Once the root of the problem is identified, the team can begin the recovery phase. The team is now reduced to the minimum number of people necessary to restore the system. Tasks are divided among as many people as needed, and the team starts working step by step on recovery.

If, during the investigation, no alerts or dashboards were found that demonstrated or helped warn of an issue, then someone should be responsible for expanding them, creating charts and alerts to ensure that this aspect of the system does not remain hidden in the future.

We took the opportunity to create a dashboard that lists the IOPs of the disk, both for writing and reading, for all servers, and we set up alerts to be triggered if the rates drop significantly concerning the number of file accesses or database accesses. We already know how long it will take for everything to be recovered, which is communicated in the WhatsApp channel or in person. Now, people seem more relieved and are even getting up for coffee.

Communication at this stage is crucial, and the team should inform what they’ve discovered and what the ETA is for the complete system recovery. This interaction should be maintained until the end. A suggestion I give is for people to be informed of the recovery situation every hour or half-hour if there’s no timeframe in sight, and if there is a timeframe, just one communication announcing that there’s an ETA and another halfway through to notify that everything is on track and recovering.

Stage 4: Closure

After four hours of hard work, the system is finally back online. One of the tables has not been fully recovered yet, but it is very large, and not having this data for the next few hours won’t significantly impact the clients’ operations. My manager makes the final announcement, and we consider the incident resolved. We schedule the post-mortem for the following day, and I fill in the completion times for the actions on my sheet of paper, adding some things we forgot to describe at the beginning.

When the system returns to a “good enough” state for use, we consider the incident closed, even if there are things that haven’t been completed yet (such as the recovery of a non-critical database for operations). It’s then time to close the incident. The team should record their final thoughts in the minutes, schedule a close date for a post-mortem on what happened, and review the notes to ensure they haven’t forgotten anything that occurred in the timeline during the problem resolution. Additionally, it’s important to congratulate all the people involved.

In some incidents, it’s desirable to designate a person involved in the system recovery to monitor it for a few more hours or until the end of their shift. This happens more frequently when the problem “disappears” without a root cause being identified. A final communication is made to all stakeholders and clients detailing the system’s current situation. As the incident concludes, it’s also important to note if the system is under observation. Technical details of the problem and the solution are welcome in this communication, as it conveys transparency and technical competence. It’s always good to remember that the message should be adjusted to the audience receiving it.

Final considerations:

Dealing with incidents is not easy. Highly stressful situations interfere with reasoning abilities, making it easier to react rather than plan. That’s why following best practices, rehearsing each stage, and maintaining effective communication throughout the process make it possible to quickly* resolve incidents and minimize the impact on the system and users.

* For some definition of “quickly,” whatever that may be.

--

--

Lucas Coppio
syngenta-digitalblog

Software developer and staff engineer even at my free time.