Postmortem (Mock)

Issue Summary:

On August 07, 2017, we experienced a major outage. This incident caused our service to be unavailable for many hours. Our best estimate is that this has affected 85% of our users (roughly 2,000 users). The root cause of the issue was due to a broken link in one of the files on our NGINX web server.

Timeline

The outage lasted between 08/07/2017 16:30PM PST and 08/07/2017 22:13PM PST.

  • 16:30 PST: An alert was received by Sumologic that our NGINX web server is down.
  • 17:12pm PST: Since Sumologic had pinpoint the issue to the NGINX web server, all engineers were put on the task to investigate the origin of the issue.
  • 18:43pm PST: Diagnostics were being runned.
  • 20:00pm PST: One of our engineers examined the processes being run, ps was used to get a report of the current processes.
  • 21:24PM PST: The problem has been identified.
  • 22:13PM PST: The problem has been resolved.

Root Cause Analysis

First, our team examined the processes that was running; ps was used to get a report of the current processes at the time. It was shown that we have a process for nginx and it was running; however, the web server did not reflect that and it was not running. Further debugging was needed. Hours later, our team has figured out the root cause of the problem. The symbolic link between /etc/nginx/sites-available/default and /etc/nginx/sites-enabled/default was broken. In order, the two files needed to be linked together. The configuration files will not be read if the symbolic link is broken between the two paths.

Resolution

To resolve the issue, the symbolic was recreated between the two files.

Corrective and Preventative Measures

Downtime is never acceptable. To ensure this never happens again, we will take several measures to improve our infrastructure

  • (1) We will create clusters to ensure that if a single web server fails, traffic will be redirected to other servers. At the same time, we will examine the server that failed to gain insight of the issue to prevent future occurrences.
  • (2) Our team will look into backing up our servers more often to limit loss of data.
  • (3) Scripts will be automatically deployed to fix the issue if the monitoring system detects that the link is between in an nginx server. While the one web server is down, users can still access other servers as mentioned above.