What We Learn from Something We Can’t Fix

Published in

ITHAKA Tech

4 min readMar 21, 2017

I can’t remember the last time I felt so helpless as a CTO as during the AWS outage a few weeks ago. My thoughts went out to the researchers and students who consistently depend on our cloud-based digital research platform JSTOR to access the journals, books, and other content they need to complete their coursework, presentations, and research papers. In some very strongly worded messages, we heard their frustration that JSTOR “only goes down every time I have a paper due.” That’s not really true of course, but I know it feels that way when you are under the gun to finish an assignment or a project. I felt equally badly for our staff who desperately wanted to solve the problems for our users, but who were dependent on Amazon’s restoration of S3 to do so.

Silver linings are rare in these kinds of situations, but amidst the feeling of helplessness, something great happened. I saw the culture we have worked so diligently to nurture at its absolute best. While social media irrupted with cartoons of software developers kicking back and enjoying a break while they waited for S3 to return to life (symbolic of teams that do not take ownership of problems caused by failure outside their control) our team rallied. I saw them meet the problems head on, communicate effectively with stakeholders and end users, and prepare to restore service as fast as possible when S3 came back. And they were profoundly successful in achieving that goal. With our move to the cloud, like others, we face some of the same issues we had before DevOps and ownership of an entire system. But, our teams have met those challenges and embraced the ideas of always having total system ownership, accountability and performance. I saw it that day even more starkly than usual.

I sent the following message to our staff after the outage:

“By now you’ve probably all seen Amazon’s report on Tuesday’s outage. I think as engineers and technologists we can all empathize with the mistake that engineer made and can easily imagine ourselves in circumstances with similar level of risk for JSTOR users. That said I want to share a few thoughts on the outcome.

First and foremost, I am extremely proud of the way our teams took ownership of this problem and I saw no abdication of responsibility by treating the event as being solely an AWS issue. This represented the best of our blameless culture. Our response was swift and we did a good job of communicating with users and stakeholders. I’m sorry I couldn’t attend the blameless postmortem (BPM), but have read the document that came out of the session. I’m extremely impressed with the thinking and ideas for solutions that manifests your sincere desire to continually improve our performance, and solve difficult problems. To be honest as intimidating as the outage was, I was more concerned with our ability to recover. Again, your dedication, creativity and old fashioned elbow grease enabled us to recover in the best way possible. Your efforts saved us from what could have been a significant negative impact on our users and the KPIs we use to measure impact and performance. I feel an off-site happy hour coming.

Second, I am very pleased with our approach of using this event as a learning opportunity. Again, from the BPM your live chronicling of the events (all 19 pages), impacts and actions taken using the BPM framework will be an outstanding learning tool going forward. From this narrative, I believe you will innovate and design new solutions that will make our site more resilient. It will also help us in planning for failure, which will mean better speed in recovery and fewer long nights and obnoxious PD alerts.

Third, I trust that you can all see the value in the efforts we’ve put into building a blameless culture where everyone is accountable for performance and all failure is systematic. If you haven’t, I urge you to carefully read Amazon’s remediation where they focused on improving the system and tools. They were transparent about the human failure, but were clear that the solution was not directed at that. I have the utmost confidence that we will continue to focus on improving our system and systems thinking when there are failures, which is the only way to build lasting and reliable solutions for our customers.”

While I hate that the AWS outage occurred, I value having had the experience and the impetus to write to our teams. I hope our story and performance will be helpful to others working to shift their own cultures and practices. It’s a hard-won road, but worth the effort.

You can read more about aspects of our move toward team ownership and our culture in a few other posts: How do Teams Really Achieve Product Quality?, The Spirit and Practice of Agile, and Change for the Better. And please feel free to drop us a comment about your own experience or with any questions.

What We Learn from Something We Can’t Fix

Written by Dale Myers