Designing for the Operator Experience

Published in

Statuscode

5 min readJan 3, 2018

With the rise of cloud providers and wider adoption of agile practices, an increasing number of organizations are faced with operating a large and growing amount of software.

This means that not only does software need to be made to solve users problems, it should be easy to operate, scale and maintain.

Solving for the latter can be thought of as Operator Experience (OX).

Understanding “Operators”

Of course the foundation of any design approach is research — after all, Operators are people and in general you design for them the same way you would design an app that allows people to pay bills online. In fact paying bills online is often an important part of the Operator’s job.

That said there are a many unique aspects to the Operators background and experience that are worth exploring. For a deeper dive into the culture of operators googling for “dev ops” and “Site Reliability Engineer (SRE)” will give you a sense of the mindset most operators have. Here are a few distilled observations I have after a year of working with Operators of Cloud Foundry.

Usually they work for a large corporations or governments. I see some emergence of a growing small and medium size market demand for SRE’s but it is still diffuse for the most part. That said some players like Rackspace are stepping up their investment in this market.

They are comfortable at the command line and have an understanding of Linux operating systems. Surprisingly they have a fairly even blend of local operating systems with Linux, Windows and Mac all having prominent usage. This diversity makes them fluent across many OS’s but usually they have a specialization in infrastructure choice with the most common infrastructure skill being AWS, but only by a small margin overall. Managing on premise infrastructure is still a huge need for their organizations and Azure is increasing in popularity (and I like GCP but it’s still not very common).

They take security seriously. It’d be false for me to say that all operators are following security best practices all the time, but I find it is usually the case that they are adopters of password vaults, two factor, and strong passwords. They are willing to sacrifice some time and convenience for security at least.

Finally, most have a deep understanding of software development, but often only in one or two programming languages. This is challenging. It means that in some cases operators may feel comfortable downloading and compiling source code, but in other cases you may need to provide explicit guidance. Also, it’s worth noting that while most SRE’s have background in software engineering some may have limited respect for agile practices and rapid value delivery.

YAML Driven Design

As your organization becomes more “cloud native” there is often an explosion of declarative state files that become associated with applications, and their corresponding infrastructure. This allows for operators to have a deployable “environment” for teams to test with, or to replicate test environments at production scale.

The way I usually see this work in the real world is that developers port over template markup files, and append onto them as the application grows, and operators are left to make sense of this without much guidance. Product Managers are often not even aware of these “spec files” as they are generally considered “code” not product.

Observing this behavior, I have found that Product Managers and Designers are missing a big opportunity to design sensical operator experiences using markup from the outset.

Let’s explore an example. Let’s suppose you are being asked to design an application that tracks time for employees and stores it into a database. During the Discovery and Framing of this product you realize that each department actually needs to utilize a different instantiation of the application and integrate with unique legacy systems. Fine, so the Product team starts mocking up a basic GUI to select and configure the time tracker for each department. Of course, this tool needs a landing page, so let’s just drop a dashboard with some summary stats on there — things we heard about during feedback rounds as “cool ideas” from executives. Oh, and since this is an internal tool let’s try out this new UI framework while were at it. We can ship this last and configure things manually for now.

In reality though this is a classic trap of not designing for your actual user — the Operator. As mentioned, most operators are comfortable at the command line and configuring markup is an ideal way to manage and track state changes by leveraging source control. Powerful features you’d never get to in an admin tool are available for free from github.

As an alternative I like to start with markup (YML specifically) and get operator feedback early on in the process. For example I might create something like the following for our hypothetical app.

department details:
   department-name: Site Reliability Engineering
   department-billing-codes: ['1234', 'abcdef']dependencies:
   ingress:
      additional-time-tracking-api: http:my-time.com
      format: json
   egress:
      database: my-sqlsecrets:

Of course, if I have access to someone who will be operating the application, I can quickly get feedback and make changes before breaking this down into stories for the Engineers. Like any prototype this is likely to be an over-simplified representation of the end state but it can drastically shorten the feedback loop between Software Engineers and Site Reliability Engineers.

Prototyping CLI Tools

If you are asking operators to do lots of orchestration or navigate a set of API’s to manage your software you should consider creating a command line interface for Operators. This may sound hard, but chances are your engineers will be excited to build one, and it can be an incredibly tight feedback loop for developing new features.

That said before you embark on fleshing out a full featured CLI tool, it’s a good idea to prototype some if the commands you are considering. There is a great post on Hacker Noon by Matt Rothenburg covering this approach. Even just mocking up inputs and outputs in a text file can be incredibly powerful.

Package Management and README’s

If you are going down the route of creating tooling that an operator can utilize for managing your software you’ll likely need to consider how the operator will install those tools. Rather than layout the pros/cons of various package managers, I am going to stress the importance of README’s. Operators actually read README files and they should have clear and concise directions for installing the software front and center.

Open Source Overlap

Finally, one adjustment that you may need to consider for approaching solutions for Operators is that often times they benefit from having multiple routes for the same functionality. For example it is not uncommon to have a CLI tool that directly mirrors the functionality of a closed source GUI. This isn’t poor coordination it’s how large software products can iterate backing services out of synch with interface development. It’s also often how for profit companies can build a business around open source tools.