The system is down: March 5, 2017 Snippets
Unless you spend the last week offline in the woods, chances are you were affected in some way by Amazon Web Services’ S3 outage on Tuesday. As is often the case with these kinds of things, the cause of the outage was traced back to a command line mistake while taking servers offline for routine maintenance- in other words, human error. But there’s an important lesson to be learned here, and we should seize the occasion to do so: in large complex systems, blaming human error in hindsight can be misguided and even dangerous. The problem lies with systems themselves, not with their practitioners.
To understand why, let’s consider a very simple system: the Push Notification. The system has one job: If something happens, notify the user. This is not a complicated system. It sounds pretty easy, actually. What could go wrong?
As it turns out, quite a lot. Systems may start out small, but they encroach. They continually expand to fill all available volume, and then some more. Furthermore, as systems expand, they tend to oppose their own function. Let’s take a look at what this looks like in our push notification example:
Our system will initially be useful; push notifications are typically helpful at first. As such, its users will come to rely on the system for staying alerted to important information. They will use the system more, and the number of notifications delivered will increase. As our notifications expand to more messages and more use cases, an increasing number of subsystems must be dedicated to preventing notifications from being displayed. After all, if a message channel gets too saturated, the user gets annoyed, and mutes it (or worse, quits the app entirely). So resources must be allocated to determining which messages the user does want to see, which they don’t, and which are dependent on the situation. More resources must be allocated to understanding the context of the situation to resolve grey areas: is the user at work? Are they on their laptop, or on their phone? Are they likely to be asleep? What if the message is really important? Who gets to decide what’s important: the sender or the recipient? Who gets to override whom? Inevitably you get something like this:
If you think this is a made-up example, it’s not: this is the decision tree for how Slack determines whether or not to send a push notification to your phone. And it’s quite sensible! But within this system lies hidden danger. The potential for an important message going unseen — in other words, the exact opposite of what the system was designed for — expands dramatically as the system itself gets larger and more important. This isn’t the user’s fault. Nor is it Slack’s fault. It’s an intrinsic property of systems.
Now imagine the size and complexity of a system like AWS; of course something like Tuesday’s outage is going to happen from time to time. But in a way, it’s good that it does. By revealing the weaknesses and unanticipated dependencies of the system, small errors help prevent big errors; big errors help prevent catastrophes. Tuesday’s event was certainly a big error, and possibly a ‘small catastrophe’, but it wasn’t as bad as it could have been. Rather than blame our unknown engineer for taking out S3 for the afternoon, we should really be saying thank you. When the entire world becomes made of software and that software increasingly runs on infrastructure like AWS, we’d better get good at understanding the nature of risk in these kinds of systems. And the only way to gain that experience is by dealing with failure itself.
Across the pond:
Foursquare’s remarkable comeback:
Other reading from around the Internet:
And just for fun:
In this week’s quick news and notes from the Social Capital family:
Kevin Worth, formerly of Bloomberg Digital Media, has joined CoinDesk as CEO:
It comes at a pretty exciting time for the cryptocurrency ecosystem, with the price of Bitcoin up 20% and Ether surging over 40% over the last month. With blockchain talent rapidly becoming a key constraint for industry growth, community resources like CoinDesk could end up playing an important role in drawing new people into the industry and helping existing community members find new opportunities. Congratulations to Kevin and to the entire Digital Currency Group family; we’re looking forward to the ride ahead.
“Keep a healthy skepticism about your data. As an analyst will tell you, correlation does not mean causation. In layman’s terms, it means that it’s never easy to get to the root cause of an issue. … It’s important to set realistic expectations and boundaries around the purpose of data in your plans. Clear-cut answers are disappointingly rare — your role as a manger or product lead is to make judgments based on limited, nuanced, or even inconclusive analysis. You could debate most questions about your product or business indefinitely. When putting together everything from quarterly business goals to engineering sprint plans to product briefs, define the few key metrics you’re trying to drive, your targets, and the strategy for hitting those targets. Then step away from the dashboards.”
And finally, Box reported a great Q4 2016, with fourth quarter revenue up 29% year over year and record cash flow from operations. Most importantly, they are officially surging out of the “J-curve”, reporting their first quarter of positive free cash flow. Great work to Aaron and the team on their long, rewarding journey!
Have a great week,
Alex & the team at Social Capital