Why UX Practitioners Should Learn About SRE

Google SRE Book

Working on a Platform or an API Product it can be challenging to form a cohesive vision, and prioritized Road Map for your Product. Agile and Design Thinking frameworks start with the end-user. Focusing on research, feedback and user needs is a good way to surface unsolved problems and a necessary step but this should not be the only methodology teams use. Often times products expected to “just work” are difficult to gather meaningful feedback around.

That said, technologies that “just work” have achieved the ultimate technology UX — you forget they exist despite being integral to your life — think of GPS, Wifi, 4G LTE etc. All of these examples went through an innovation cycle to make them consumer ready but doing end user research for them as Products would be difficult. These are the perfect examples to consider the practices outline by Google in their Site Reliability Engineering book.

Researching Reliability

Understanding reliability is an equally complex problem to understanding user needs and we still need to consider the user — even more important than poor reliability is the perception of poor reliability. That why it’s essential that balanced teams start involving UX researchers in the reliability research of their product as ultimately this is a tool for product design.

That said understanding reliability requires a new vocabulary and a comfort level with automation and statistical analysis that may not be familiar to some. These are new skills for most researchers but they are teachable. Most researchers will find it most natural to start with the concepts of Operator Experience Design (OX) but the research toolkit is much richer than that.

Operational Readiness Review

Building products that can scale is hard. So hard in fact this is the stumbling block for many new products. Even Google potentially missed social due to scaling challenges on their otherwise successful social network Orkut. These experiences (and many moreI assume) prompted Google to adopt a process of review for products before they are operated at scale.

This may sound orthogonal to agile practices but the idea is not to achieve perfection but rather to define a Minimum Operable Product. It’s also designed to identify unexpected operating scenarios and better plan for them. This research is done with the engineers of the product and the result is to determine an assessment on two axis — likelihood and impact.

From the Google Cloud Platform blog.

I’d argue you could probably ignore rare or minimal occurrences for the most part leaving you with one of my favorite tools and one most UX practitioners know well — the 2x2.

Measuring Reliability

While an operational readiness review provides an equivalence for traditional qualitative UX research it is important to develop instrumentation to measure the reliability of your product. This is a complicated process and an anecdote as to how Loggregator defined Service Level Objectives might be helpful but the core concepts are the following.

Blackbox Monitoring — Blackbox monitoring comes from the idea of treating your product like a blackbox, and focusing on measuring the inputs and outputs of the system. It’s incredibly effective especially at removing bias and unknowns and is easy to explain and understand.

Error Budgets — Once you have functional and precise blackbox monitoring deployed the logical next step is to set a goal for these metrics. In many organizations this focuses on counting 9’s of reliability but the SRE practice encourages focusing on the inverse. That is the available budget you have to play with to stay with the Service Level Objective (those are the 9's). Thinking about this like a budget allows Operators to plan the upgrades and new features.

It’s worth pointing out that this creates a healthy tension of business expectations, pressure to innovate, and operational demands. That’s why it’s so important and teams should “go to bat” in advocating for their objectives and budgets. Thinking of reliability objectives as budgets for innovation is a conversation that all stakeholders, including UX designers should be involved in.

MTTD and MTTR — These acronyms are shorthand for Mean Time to Discover, and Mean Time to Repair — essential metrics for operators to asses the production readiness of a system. It’s important to remember that Operators plan for things to break so naturally their star metric for software is the time it takes to fix it.

Synthesis

As with user research much of the insight comes from the synthesis of your findings along with anecdotes from users and operators. While this may seem daunting I have found that understanding a features evidence and readiness, along with it’s complexity is a helpful persistent artifact to manage while your team prioritizes work.

Living Roadmap Artifact

Using this process to form a “Living Roadmap” I have found especially effective for medium range planning. Ultimately I like to rely on metrics to inform the strategic direction and with reliability this comes down to answering two simple questions.

  1. Are we currently meeting our users reliability expectations? If not in what way and how do we address it?
  2. If we are meeting expectations how do we plan to measure the reliability of our next feature?

Practical Examples

These concepts can seem abstract, but the process has proven to be extremely helpful for the Loggregator team. Based on an application readiness review we determined two Frequent and Damaging reliability issues.

To see how we distilled this down into deliverable features see my post about Rapid Troubleshooting of the Cloud Foundry Logging System.