Towards Operational Excellence

Part 1 — Customers, Culture, and why you should care.

Adrian Hornsby
Jan 15 · 11 min read

I would like to express my gratitude and appreciation to Peter Vosshall, Distinguished Engineer at AWS, for the inspiration behind this series of blog posts. Peter’s work has always been an inspiration to me, and without him, this series wouldn’t exist.


Once systems are designed, implemented, and tested, we come to what is arguably one of the hardest aspects in the lifecycle of a system: bringing it to life and sustaining it in operations. In this series of posts, I’ll discuss Operational Excellence, focusing on the three essential interconnecting elements that enable you to successfully operate the technology you’ve built — Culture, Tools, and Processes.

Part 1 of the series will cover the cultural side of Operational Excellence. Part 2 will discuss Tools. And Part 3 will cover Processes — or what we prefer to call Mechanisms.

Let’s get to it!

Note: This series of blog post differs from what I usually write about. So, in order to respect your time, if you expect some deep-dive on technology and some code, feel free to move on. If instead, you are interested in culture and change, this is for you.


What is Operational Excellence?

Before answering that question, I should point out that when the whole business is fundamentally dependent on technology, which is the case for Amazon’s activities — including AWS, Operational Excellence (OE) is critical. It was already true when Amazon launched in 1995 and only sold books. If the online bookstore was down, it was as if it was locking its front doors. Since Amazon already had customers across the globe at that time, there wasn’t an appropriate time to close the front doors.

Amazon.com bookstore — 1995

If you look at the past, I think it’s fair to say that since then, Amazon, including AWS, has had a successful track record. Being a successful technology company requires OE. But what is OE? And how do you achieve OE?


Once upon a time

Back in the early days, systems running the amazon.com bookstore were reasonably straightforward. Not quite as simple as shown in the picture below, but this was the essence of it. It was a simple, three-tier architecture, with a front-page, few internal tools, a web server, and a database.

The same database was used by the bookstore, the customer service tools, and the fulfillment center tools. As you can imagine, with this “everything shared” architecture, Amazon quickly ran into scaling and innovation — or the lack thereof — issues.

This “monolith” was slowing teams down. Long delays between development, testing, and release to production were common problems. It was challenging to deliver value to customers fast enough.

In other words, it was choking innovation.

Breaking the monolith

Amazon invested heavily in cleaving this monolith and decoupling its components by moving to a Services Oriented Architecture (SOA), where each functional domain would be separated into a separate service and provided to other services through an API over a network. But that wasn’t all. Each service would be managed by a separate team that owned and operated that service.

In doing so, Amazon was able to innovate quickly and easily support hundreds of million active customers on an architecture composed of thousands of different services.

So, this brings me back to our initial question:

What is Operational Excellence?

Let’s be honest — OE is a problematic term to define. However, one thing for sure is that it isn’t a list of things that you do or boxes you check. Instead, as Aristotle said, it is more of a habit — something you repeatedly do.

According to Wikipedia, a habit is “a routine of behavior that is repeated regularly and tends to occur subconsciously.”

And indeed, OE resembles a habit, a philosophy, a mindset — one that embraces problem-solving, one that values continuous improvement, and one that aims to exceed goals consistently. It’s a way to anticipate, address, and effectively respond to issues. And, for Amazon, it also means doing all of that at a significant scale, where significant can mean thousands of people and millions of servers across the globe.

So, maybe a better question to ask is:

How does a technology organization move toward OE?

In my view, it takes three essential interconnecting elements to operate the technology we build successfully. First, you need to have the right culture. Second, you need great tools. And third, you need complete processes.

As mentioned in the introduction, I’ll discuss each element in this series of posts. This one is about culture.


Culture

Amazon’s culture is probably best understood by examining the Leadership Principles. They are the core values by which every Amazonian operates — the DNA of our company. They have been a revelation to me and a big reason why I have loved working with AWS.

Unlike most companies I’ve worked with, Amazon’s Leadership Principles aren’t just inspirational wall hangings. Amazonians live and breathe them every day. Whether they’re interacting with customers, discussing new ideas, deciding on the best way to patch a server, or hiring people, the Leadership Principles are the guiding lights. While they are called “leadership” principles, they apply to everyone, equally — whether it’s Jeff Bezos or a software developer hired out of college. It’s one of the things that makes Amazon peculiar.

All fourteen of them are important for OE, but in my opinion, few hit the nail on the head.

Let me address three of the most obvious ones first, and I’ll touch on the others in following blog posts.

Customer Obsession

Leaders start with the customer and work backward. They work vigorously to earn and keep customer trust. Although leaders pay attention to competitors, they obsess over customers.

Insist on the Highest Standards

Leaders have relentlessly high standards — many people may think these standards are unreasonably high. Leaders are continually raising the bar and drive their teams to deliver high quality products, services, and processes. Leaders ensure that defects do not get sent down the line and that problems are fixed so they stay fixed.

Ownership

Leaders are owners. They think long term and don’t sacrifice long-term value for short-term results. They act on behalf of the entire company, beyond just their own team. They never say “that’s not my job.”


Customer obsession is, arguably, the keystone value for the entire company. So much so that it is the focal point of the company’s vision statement.

“Our vision is to be earth’s most customer-centric company; to build a place where people can come to find and discover anything they might want to buy online.” — Amazon Vision Statement

First, it is essential to note that customer obsession is much more than customer satisfaction or customer happiness. The philosophy is about doing what is right for the customer first, then working backward. It’s been that way since the early days.

In engineering, that means you put the customers first and find ways to do more for them, faster, without the resources of much larger companies. At times the challenges seemed daunting, but as the old adage says:

“necessity is the mother of invention.”

To illustrate the importance of customer obsession related to OE, let me explain the Amazon Flywheel.

Amazon Flywheel
  1. As you may have noticed, customer experience is critical here. It’s where it all started.
  2. As customer experience improves and excels, it naturally drives more traffic to Amazon.com. (e.g., the One-Click checkout and Go Stores)
  3. As more people use Amazon.com, it attracts more sellers.
  4. More sellers naturally provide a greater selection of products for customers.
  5. Thanks to increased sales on Amazon.com, Amazon can, in turn, lower its cost structure and reduce prices for the customers.

Once the Flywheel starts, it’s a snowball effect; traffic increases, leading to more sales and lower prices.

As Werner Vogels explains in this blog post, people often think being innovative is only about discovering uncharted territories. Yet, and as demonstrated by the Flywheel, it’s also essential to innovate across dimensions that are important for customers, such as low prices, a wide selection of products, fast delivery, and of course, convenience.

“Even when they don’t yet know it, customers want something better, and your desire to delight customers will drive you to invent on their behalf.” — Jeff Bezos, 2016 Letter to Shareholders

For AWS, the dimensions important to customers are directly related to OE: scale, reliability, security, performance, ease of use, and of course, pricing.

Any improvements made to these dimensions will generate both long term and immediate benefits for AWS customers. For example, as the AWS platform grows, its scale enables AWS to operate more efficiently, and it can pass the benefits back to customers in the form of cost savings. And, since AWS launched in 2006, prices have been reduced 78 times.


Peculiar ways

Amazon does many things every day to be better for customers — to operate better, innovate faster, and reduce customers’ costs. They are Amazon’s “peculiar ways.” You have to be willing to do things differently from others — and sometimes, that way might seem a little odd or hard to understand.

Let me touch on some of these peculiar ways.


All customers are equal

Amazonians genuinely obsess over customers and working backward from them to help us raise our standards.

In the early days of AWS, a person under the alias Low-Flying-Hawk regularly suggested new features in forums, and AWS teams began to look forward to that feedback so much that they would ask in meetings:

“What would Low-Flying-Hawk say?”

The funny and peculiar thing is that Low-Flying-Hawk didn’t spend a considerable sum with AWS — $3 a month to be precise — but because this person’s input was so valued, Amazon named a building after its alias.

Great products and services come from a deep understanding of the customer. If we jump straight to a solution without spending time listening and thinking about customer needs, we limit our options for inventing a delightful experience for customers and thus limit our ability to set the Flywheel in motion.


Two-pizza teams

First, we’re wedded to the idea of using “single-threaded teams” to accomplish things for customers. While many organizations organize themselves into functional silos (separating operations and engineering), AWS believes that a customer-focused organizational model where all the functions live under one roof and are focused on a single customer need gets the best results.

Each service is owned and operated by a team called “a two-pizza team” since they are rarely bigger than 8–12 people who build, deploy, maintain and operate the service.

Yes, you read correctly! Build. Deploy. Maintain. Operate.

They even Deprecate if they have to. That’s not something developers love to do, but in many cases, it’s essential to do correctly to minimize tech debt; done poorly, it results in increased costs, unmanaged dependencies, and randomization of the team.

But as you can see in the picture below, they aren’t responsible for building the tools. For that, AWS has dedicated teams.


You build it; you ship it

Amazon has a strong culture of ownership, and ownership extends to operating software in production.

Amazon doesn’t believe in “throwing it over the wall.”

That means developers get to see how their code works in production. One of the significant benefits is developers have the insights to immediately understand and address the contributing factors of operational problems they encounter.

Of course, two-pizza teams have distinct quality assurance and operations resources, but they are part of the same service team and work closely with developers.

While many companies have focused on tearing down the wall of confusion between development and operations — a movement known as DevOps transformation — Amazon hasn’t had to do this, because, to begin with, it never had a wall.


Take away

You are probably wondering what to take away from all that — and that’s a fair question. The first thing that I want to point out is that you don’t have to tear down the entire organization or change its culture. Try some of these ideas on a single team or a small project first, and observe what happens.

For your next project, instead of implementing new features first, ask yourself — honestly — if this is right for the customers. Ask them too. Spend more time with them.

Look at ways to improve the feedback loop from your customers back to the developers. Can your developers talk directly with customers?

Think at ways to break your architecture and teams down to enable the agility you need.

Experiment with the two-pizza team idea. Let your team own the whole process; From the initial design to the release to production and maintenance. I have never met a developer who doesn’t like the idea of releasing her/his work to production. Never doubt that a small group of empowered, committed developer can do miracles.

Finally, give some thoughts into developing your Leadership Principles — after all, this is where it all starts. Good Leadership Principles will be very inspirational for your employees. But don’t feel bad if you can’t find yours quite yet.

One peculiar mechanism that is used at Amazon to deals with ambiguity or uncertainty is the effective use of tenets. Teams define their tenets. Anyone in the organization can challenge tenets if they know off better ones.

What are your tenets to developing, delivering, and operating secured applications in the cloud?

Taking even a short amount of time to think of tenets for your team, and allowing them to be challenged, amended, or modified, will open the door for change and improvements to naturally happen.

“You can write down your corporate culture, but when you do so, you’re discovering it, uncovering it — not creating it. It is created slowly over time by the people and by events — by the stories of past success and failure that become a deep part of the company lore.”

— Jeff Bezos, 2015 Amazon.com letter to shareholders


Wrapping up

Organizations have to manage their operations with greater efficiency to provide high-quality services to their customers at a reduced cost. By focusing on incremental improvements and small changes, you will quickly enable the success of your business by increasing the efficiency and effectiveness of your operations. And before you know it, the culture of your organization will slowly start to develop too.

If you decide to put OE on your company’s agenda — and you should — you have to think carefully about the culture in your organization and how it forms sound living habits. And remember, you must turn OE into habits, not a one-time thing because it’s those habits that will eventually knock at the door of your customers, and when they do, you want your customers to be smiling :)


That’s all for Part 1, folks. Please don’t hesitate to share your feedback and opinions. In the next post, I’ll discuss why tools are so crucial to Operational Excellence.

  • Adrian

UPDATE: Access Part 2 and Part 3

Adrian Hornsby

Written by

Principal Developer Advocate, Architecture @awscloud ☁️ I break stuff .. mostly. Opinions here are my own.

More From Medium

More from Adrian Hornsby

Also tagged DevOps

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade