Post Mortem Communication
Last July 21st, DigitalOcean started having a few issues with their New York based datacenter. They kept customers updated using their status page. They sent out an email to all their customer today explaining the issue in detail.
This is one of the best examples I’ve seen for handling post-mortem communication and the breakdown of their execution should be what the industry uses as a standard.
The first paragraph starts with a sincere apology. It provides a short summary of the incident followed by a reassurance of their commitment to the customer.
Hi, I would like to take a moment to apologize for the problems you may have experienced accessing your droplets in the NYC2 region July 21st, starting around 6PM Eastern time. Providing a stable infrastructure for all customers is our number one priority, and whenever we fall short we work to understand the problem and take steps to reduce the chance of it happening again.
The next few paragraphs focus on the issue directly. There’s no beating around the bush, or an attempt to downplay the incident. It’s three straight-forward paragraphs detailing the issue.
In this case, we’ve determined what were a few related events which contributed to the outage:
First, we had a problematic optical module in one of our switches that was sending malformed packets to one of the core switches in our network. Under normal circumstances, losing connectivity to a single core switch should not be problematic since each cabinet in our datacenter is connected to multiple upstream switches. In this case, however, the invalid data caused problems with the upstream core switch.
When the core switch received the invalid packet, it triggered a bug in the software on the core switch which caused some internal processes that are related to learning new network addresses to crash. Some of the downstream switches interpreted this condition in a way that caused them to stop forwarding traffic until the link to the affected core switch was manually disabled.
Once traffic forwarding was restored to the core switches, they were flooded with a large volume of MAC address information. Our network is built to be able to handle a complete failure of half of its core switches, however the volume of address updates as a number of cabinets simultaneously cycled between up and down triggered built-in denial of service protection features. This protection caused the core switches to be unable to correctly learn new address information, ultimately leading to connectivity problems to some servers.
Next is a clear call to action. It shows that they are not brushing this off, but rather, intent on doing something to prevent this from happening in the future.
Our network vendor has been engaged, and we’ve been working together to attempt to fully understand the scope of the problem and steps that we can take to address it. Concretely, we’ve begun evaluating some software updates that we believe may improve the situation. If we determine, as we hope, that these changes will improve stability in this type of situation we will build a plan to upgrade our core network to this version as soon as possible. In addition, we continue to look for additional configuration changes that we can make in the mean time to help prevent this type of problem.
Reassurance to the Customer
Nearing the end of the email is another reassurance of what their priorities are. It’s a strong testament to make that they understand the gravity of the situation, and that they’ll do everything to validate their findings.
DigitalOcean’s top priority is to ensure your droplets are running 24 hours a day, 7 days a week, 365 days a year. We’ve taken the first steps to fully understand this outage and have begun making changes to greatly reduce the likelihood of a similar event in the future. This work is ongoing and we will continue to make changes and validate our infrastructure to ensure that it behaves as expected in adverse conditions.
Backing the Assurance
Finally, they do something that proves they mean business. They issue SLA credit for the downtime. They also make it clear that they fell short, and that this is a gesture to stand by their commitment.
We will issue an SLA credit for the downtime you have experienced. We realize this doesn’t make up for the interruption but we want to uphold our promise to our users when we fall short.
Thank you for your patience throughout this process. We look forward to continuing to provide you with the highest possible level of service.
VP, Technical Operations
On Digital Ocean’s blog about Mark Imbriaco (from May 19, 2014), he explains his for writing an ideal post-mortem:
- Apologize for what happened.
- Demonstrate you understand what happened.
- Explain what you will do to reduce the likelihood of it happening again.
The post-mortem above is what I’d consider a great example of how to explain the situation and show that the business takes the situation seriously.