How to handle a Service Outage

Lessons Learned from Ritual Coffee Roasters.

Thomas Schranz
Product Love ❤

--

Service outages happen. Usually at the worst possible time.
On the other hand is there ever a great time to have a service outage? ☺

The question rather is how do you handle them in the best possible way?

Put yourself in your Customer’s Shoes.

Earlier today I went to Ritual Coffee on Valencia Street to grab a coffee.
To my surprise they were closed for the day because they are doing a renovation. They’ve probably announced it but I didn’t know.

Yet they did a fantastic job putting themselves in my shoes.

Two of their super friendly barristas were ready to greet me in front of their store. They’ve apologized, explained the situation to me and offered everyone free coffee to make up for the inconvenience.

My Facebook status update on Ritual Coffee’s fantastic service.

This is first-class downtime handling. They weren’t operating like usual but I still got my coffee, the main reason I’m visiting them regularly.

They clearly care a lot about their customers.
They’ve put themselves into the shoes of their customers.

What does a service outage mean for your customers?
Take time to think about it.

Be prepared.

The main thing you can do is to be prepared for an unexpected outage.

Because whenever an outage happens you won’t have the time & resources to keep a calm head and think about how to deliver an ideal outage experience.

Having an action plan is invaluable. Its like a checklist to go through so you don’t forget something important. Think fire drills.

Companies like Netflix take being prepared so serious that they even have a service called Chaos Monkey that randomly shuts down parts of their infrastructure.

The Chaos Monkey open source project by Netflix.

Being prepared for outages is really important. This involves many aspects like robust, fault-tolerant infrastructure as well as communication skills …

Get on top of Communication.

I’m your customer, I care about your service, I’m worried.
Show me that you’re on top of the situation. Communication is key.

Buffer is another fantastic example for a company that did an outstanding job at handling an unexpected situation.

Here are few thoughts on service outage communication …

  • Give me a heads-up.
    If you expect a downtime going forward tell me in advance.
    Give me enough time to prepare and make sure the word reaches me.
    Don’t be afraid to overcommunicate. Email, banners in your app, …
  • Tell me what’s going on.
    When Buffer had a security breach late last year they’ve handled the situation like no company I’ve ever seen. I think they got back to every single tweet that went out. They did a fantastic job to keep everyone in the loop about what’s going on.
  • Follow up once the situation is resolved
    Now that the situation is resolved tell me what happened and what you will do about it going forward. Again, what does it all mean for me as your customer?

Show me a Status Page.

As your customer I’m interested in what’s going on,
not in cryptic error messages.

A common pattern for handling service outages is to prepare a status page.
So in case of an expected or unexpected outage you can flip a switch
and have the status page up in an instant.

Github’s Status Page. They also post updates on Twitter

Some thoughts about status pages …

  • Think Information Hub
    I just came here to use your service. What do I need to know?
    When are you back up? What can I do in the meantime? Where do I get further information?
  • Add your Twitter Timeline
    On a similar note add the timeline of your main Twitter account or a even a dedicated status account. This is a super simple way to keep everyone informed. Twitter is independent from your infrastructure (which might be affected) and everyone has a client for it.
  • Simple & static
    Prepare a super simple page, ideally static, as low-tech as possible.
    Chances are your status page will get a lot of traffic, you don’t want to worry about scaling dynamic requests.
  • Independent of your Infrastructure
    Sometimes unexpected things happen that are hard to prepare for.
    In 2007 a truck ran into a Rackspace datacenter, taking down a lot of web services. Even if you have a super redundant setup on EC2, you might want to have your status page set up at a different provider, maybe even multiple independent providers.
  • Use 503 as HTTP Status Code
    You don’t want search engines to replace your indexed content with the content of your status page. Make sure to use the proper HTTP status code to serve a status page in place of pages that would usually work just fine.

Service outages happen. Put yourself into my shoes.
Be prepared, keep me posted, use a status page.

If you found this post helpful follow me on twitter where I tweet about Software Development & Product Management ☺

Also make sure to check out Blossom an Agile/Lean Project Management Tool I’m currently working on ☺

--

--