Support Post Mortem: Slack’s Downtime

What did Slack do well, and what hurt them during their downtime yesterday?

Yesterday, Slack had some pretty significant downtime, with nearly 100% of users unable to log into their accounts to send messages, or really do anything. As is typical for support teams to do after major incidents, we’re going to do our own mini post mortem to determine well how Slack handled the incident from a support perspective, and what they (and you) can learn from it.

What went right:

  • Slack was open and transparent about the issue on Twitter, and they were responding to seemingly everyone; they were genuinely creative and empathetic in their responses. Even better? It wasn’t even a full-time support agent responding: it was a member of Slack’s product team during his weekly support shift.
  • Their response times during the outage were fast. A co-worker of mine submitted a ticket through their email address, and he received a response in under an hour.
  • They have a status page that tells you the status of the Slack system, meaning users can check during future issues to prevent unnecessary customer contacts.

What went wrong:

  • While they have a great set of tools at status.slack.com, the outage affected their entire site, and so their (self-hosted) status tool was unavailable.
  • The messaging inside of the Slack app was ambiguous and offered no explanation or details on why the app wasn’t working. Users were presented with a generic “Reconnecting in 60sec” message.
  • Their support form on the Slack website was also down, meaning users were unable to report issues through any typical support channel EXCEPT Twitter.
  • Twitter, while usually great for quick responses, is a pretty terrible support channel in general due to the a) public nature b) limited characters c) lack of routing or intelligent tools, and d) 1–1 nature of the reactionary communication.

What can we learn?

  • First, it’s a good idea to have key pieces of your infrastructure hosted on separate, or at least redundant systems: a status page should likely be hosted on a 3rd party site to prevent major downtime from removing a critical piece of your support strategy.
  • Communication about the issue should be clearer where most people will see it: in the app. Instead of seeing a blank screen or a “Reconnecting” dialog, users should be directed to the aforementioned status page to see if anything is up.
  • While creative Twitter messages are great, 1–1 communication during an outage is still a huge waste of resources and time when it’s the same message being said over and over again. Unless there’s something unique about the situation, 1 to many is a better approach. In-app push notification on iPhone/Android by a third party would be ideal to notify users and prevent them from creating a support contact (coincidentally, Helpshift is releasing the new Campaigns feature for Support teams to do just this).
  • Have employees from other teams help out with support! This is one of the best ways for a company to stay in tune with their users, and it helps other teams understand how their actions impact support, reducing burden in the long run.

What are your thoughts? How did Slack’s outage affect you, and what would you have liked to see them do?