Engineering & Operational excellence at “ASOS” scale — Part I

Published in

ASOS Tech Blog

15 min readJun 8, 2023

This is the first in a series of posts which will discuss how we continually improve our engineering and operational practices to meet the needs of a global customer base.

Today I’ll be walking you through the ASOS Fundamentals, which are a set of definitions and measures used to guide teams in building reliable and resilient software, and they were born from the following question:

How can we understand whether our many services are built to the right standard?

The answer to this fundamental question (sorry, it’s the only time I’ll make that joke… possibly) is what I aim to describe in this post — what we did, what we learned, and what we plan to do in the future.

Firstly, Who am I?

I’m Scott, a Principal Software Engineer working in the Operations and Reliability area. I joined ASOS in 2013 and, aside from a short stint away to “try something new”, I happily returned in 2021.

Over the last year or so I’ve worked within Reliability & Operations, having spent the time before supporting a variety of our digital platforms, and I’m loving every minute of it.

So what does a Principal Software Engineer do?

I’m part of the Principal Engineering team, and as well as providing strategy, thought leadership, coaching and mentoring across the technology organisation, we work alongside our peers, amongst the platform teams and other “central” functions, to champion good engineering culture, drive continuous improvement in tooling, processes and practices that enhance our ability to deliver software.

We are less about what we are building, and more about how.

Where do Principal Engineering sit in the organisation?

A description of the entire ASOS organisational structure is outside the scope of this article, but the diagram hopefully illustrates a snapshot where Principal Engineering, our peers in Architecture, and Reliability & Operations sit in relation to the platform teams (in light grey) who ultimately build, deploy and operate the logical services they own, and which make up part of a large-scale microservices estate.

That’s nice, but what does this have to do with anything?

So now we understand that it’s the platform teams that ultimately build the services our customers enjoy, and they do so using established engineering and delivery practices like Agile and DevOps — in particular “you build it, you run it”.

Historically, beyond some light (but non-negotiable) technical constraints such as “Cloud first, Azure first” we gave teams the freedom and autonomy to solve problems in the ways they thought most appropriate.

This is powerful — we know that there’s rarely a “one size fits all” when it comes to software engineering and we try to avoid prescribing solutions from outside, however it’s fair to say that over time this made it difficult to gauge how teams were balancing this autonomy with the responsibilities that come with operating complex software — could we diagnose issues quickly, how resilient was the software, was it appropriately performance tested, could it be deployed at 3am in the morning?

“Reliability is the most important feature of any system”, Google Site Reliability Engineering

Yeah but “they build it, they run it”… so why does it matter to anyone outside the team?

In a nutshell, it’s because services do not operate in isolation when you zoom out a bit.

They’re part of a complex, interwoven estate, and while we of course establish clear boundaries between services, separate concerns where we can, and strive for asynchronous and event based approaches where possible, there’s no getting away from the fact that while Microservice architectures offer much to be lauded, they do bring challenges too — namely, you need to stitch things together at some point.

Consider the not impossible scenario where a key technology in our underlying Cloud platform becomes unavailable, do we want every individual team using this technology to reach out to our suppliers or partners? They could, but this is inefficient and most crucially, impacts our ability to restore service to our customers as quickly as possible.

That’s where Reliability and Operations (now known as Re:Ops) come into the picture — we see the breadth of the estate, rather than the depth, act as a safety net and an enterprise scope incident/response capability, and as a result, are an important voice in the conversation when it comes to shaping the language used to help us understand whether each service has been built with the right amount of engineering and operational rigour and readiness.

Enter the Fundamentals 🏗️

In late 2017, a group of people from across Tech created the ASOS Fundamentals, which at their core described 4 pillars focused on employing greater reliability engineering practices and driving improved operational maturity into our software estate.

This is reflected in the diagram below, which was lifted directly from the original presentation deck:

As you’ll see, as well as defining the pillars, at the same time we also took the first steps on our Site Reliability Engineering journey, establishing SLOs and SLIs to measure Availability and Latency.

This wasn’t because SRE was the new and glamourous “thing” (well, not wholly anyway 😊), but more to emphasise that these new Fundamentals were not an arbitrary reporting mechanism or a stick to beat teams with, they were designed from the start to be a fundamental (sorry, not sorry) agent for change.

We wanted the Fundamentals to prompt action and meaningful change, to give teams a set of objective measures (and guidance) that could be used to help them understand what “good” looked like, provide a means to prioritise related work alongside their other many commitments, and importantly, measure the outcomes/impact of this investment through SLOs.

For this article I’ll be focusing on the Fundamentals, as we’ll be delving into SLOs and our observability journey in a future article by our very own Dylan Morley.

The Four Pillars

There were a number of influences for our pillars, which ranged from drawing directly on the experience, learnings and knowledge we had within the organisation and gathering insight from other organisations and partners such as Microsoft.

Around the time we were creating the Fundamentals, the seminal book “Accelerate” came out, and it’s fair to say this book shaped much of our thinking and prompted us to not only include pure operational concerns, but also categories which reflected the engineering practices proven to lead to better software delivery and reliability.

That said, while “Accelerate” encouraged the avoidance of the maturity matrix and instead urged a focus on capabilities and continuous improvement, at the time we were starting effectively from point zero, so whilst we strove to weave these capabilities into our fundamentals, we opted to wrap these in the familiar structure of a maturity framework.

“Adaptability is one of the most potent human skills”, Sukant Ratnakar

The diagram below shows the pillars and a subset of their associated categories:

As you can see there’s a wide range of categories here, so rather than go through all of them, I’m going to delve into a couple to demonstrate how these Fundamentals were framed both to drive greater operational maturity, and to encourage reliability-centric engineering practices.

Random interesting thing — When you look at the image above do you see some random dots appear? I didn’t notice it initially, but one of our editors did and now I see them all over the image! Turns out it’s this — https://en.m.wikipedia.org/wiki/Grid_illusion

Monitoring & Observability — Logging

Let’s first take a look at the definitions within the Logging category, which also demonstrates the associated “scoring” approach we used — standard RAG, nothing revolutionary:

These are the definitions which we used for many years to help teams evolve their logging practices, and while they are quite verbose — a topic I’ll revisit later — you can hopefully see how the RAG status works — the goal for teams is to “aim for green”.

Now let’s look at the Deployability — Provisioning Pipelines category:

Again, you can see here how as we progress through each “level”, the practices associated with each score/colour become more advanced, and in this case reflects a drive towards fully automated, secure and repeatable pipelines.

While this category and the definitions are a little less verbose than those in the Logging category, by seeing these two samples I hope to give an indication as to the degree of work that went into creating these definitions, which almost immediately threw up an interesting challenge -

The ASOS Technology estate is vast, and the technologies used to solve problems, is wide.

To try and come up with definitions which accommodated these differences would have been a significant, time consuming challenge, so we needed a way to narrow our focus so that we could launch the Fundamentals, put them to the test and evolve them over time.

“The Coliseum in Rome wasn’t built with drawings and words, it was built brick by brick”

Introducing The “Golden Thread”

While all the services we build are important, there are obviously degrees of criticality, and like many organisations we use a fairly standard set of criteria to classify the software we build and operate, ranging from Experimental, which in the event of unavailability could cause “Disruption to a small proportion of a non-mission-critical customer journey” to Mission Critical which could lead to “Widespread business stoppage with significant revenue impact” and “Public, wide-spread harm to reputation”.

Alongside these definitions we also have The Golden Thread, which are the Mission Critical services which underpin the customer experience and without which, we are unable to provide customers with the ability to browse and buy product using the most common payment type, a Debit/Credit Card.

“All Golden Thread services are Mission Critical, but not all Mission Critical services are Golden Thread.”

Any service which is a “Golden Thread” service — which also includes our front end UI components — are deemed our most important customer-facing services, and as such, provided a natural scope for us to focus on.

That’s not to say other Mission Critical services are not equally important — there’s no point taking an order if you can’t process it within an acceptable timeframe, but the crucial systems that meet those needs are more asynchronous/message/queue based, so are intrinsically more tolerant to interruptions in service. They’re also not customer facing.

So we took some intentional steps when we launched the Fundamentals:

Frame the definitions to the patterns and technologies used across the Golden Thread — which is essentially (but not exclusively), conventional Web/API centric.
Prioritise the rollout to teams building and operating Golden Thread services over other services.

And this is how we approach much of what we do at ASOS — you can only get so far by squirrelling away in rooms writing documents or drawing diagrams — it’s important to do enough thinking and planning, but strive to put initiatives to the test as soon as possible, learning and adapting as you go, which gives me the opportunity to use one of my favourite sayings:

Progress over perfection.

So now we’ve discussed how we came up with the Fundamentals, how we narrowed their scope and the work that went into coming up with them, let’s talk about how we track progress over time…

The Spreadsheet of Glory

At ASOS in 2023 there’s a concerted effort by many of us to eradicate the Excel Spreadsheet from the engineering landscape, something which our recent adoption and launch of Spotify Backstage (of which I’ll be doing a future post) for Software Cataloguing and Engineer productivity plays a large part, however at the outset of the Fundamentals initiative, something quick and easy was needed, and not unlike pretty much every organisation on Planet Earth, a spreadsheet was created to meet this need.

“Everything should be made as simple as possible, but no simpler”, Albert Einstein

Using the spreadsheet, teams were responsible for managing their scores, month by month, and were assisted in “grading” their services by Reliability Engineering and Principal Engineering, who were able to offer an objective view from “outside” of the team, provide clarifications and wider context, and relay/incorporate feedback about the Fundamentals themselves that we could use to hone and improve them on an ongoing basis.

So we had the scores, what do we do with them?

The Monthly Tuesday Afternoon Meeting

Data itself is not useful without context and a conversation, so a monthly meeting was in place where the scores and any available SLO data was shaped into a format that allowed each platform team to talk through their progress, surface any concerns or risks, and discuss learnings and priorities.

As you can imagine given the number of services involved, this was a large meeting, so a lot of effort was put into making it as efficient as possible, and for the next few years, the Fundamentals became a crucial means for us to weave operational awareness and reliability into the culture, practices and technical capabilities of the teams using them, and the monthly meetings were a key touchpoint to keep momentum going.

However, with the number of people and platforms involved, the (necessary) structure and order that was needed to keep the meeting efficient, contributed to a subtle change in how the Fundamentals were perceived, and it was that — and a whole range of other learnings and feedback — which prompted us to refresh the Fundamentals entirely, something which we are in the process of doing now, and which we plan to launch in the coming weeks.

So what do we know, what did we learn and what are we going to change?

What we know both quantitively and qualitatively is that the Fundamentals delivered demonstrable significant, positive change and improved reliability across the services in scope, which is something we’ll likely discuss more about our SLO performance in the coming article.

That said, we learned a great deal along the way, and living the ASOS value of “Being Brave”, we’re changing some things. And those things are pretty much everything. And that’s OK — it’s all part of the journey, and we couldn’t have got to where we are without walking this road.

So what are we changing?

1) Fundamentals for all!

You’ll recall I mentioned that we intentionally limited the scope of the Fundamentals to our “Golden Thread” services, both from a technological perspective and engagement perspective.

Well, one of the successes of the framework is that it very much helped to prompt and maintain an operations awareness within teams building these services, and it’s fair to say that it also helped engineers and delivery leaders have more informed conversations about the prioritisation of work outside of feature delivery — eradicating toil, continuous improvement, tech debt reduction.

However, teams that were building critical services that weren’t measured by the Fundamentals were not able to use this language as a negotiation or prioritisation aid.

Furthermore, even if they wanted to, because they managed technology estates that were quite different from the technical shape of the services the Fundamentals were written for, many of the definitions simply didn’t apply, or were too specific to the Golden Thread technology stacks.

So what we did is we re-wrote the definitions from the ground-up, we kept a lot of what was there, but our guiding principles were to simplify and make the definitions as technology agnostic as possible.

Let’s revisit the example I shared earlier around the Logging definition:

Original version:

New version:

As you can see, the definitions have been dramatically simplified — this makes it much easier for teams to interpret what we’re looking for.

Additionally, you’ll see that we refer to “logging guidance”, which allows us to use the Fundamentals and the associated scoring to sign-post teams into guidance which, if they follow, will allow them to increase their score (and ultimately the reliability of our services for customers).

Over time we hope to build up a solid body of guidance, help and templates which teams to learn from others, reuse established patterns and “level up”, and this was born from feedback we received where some teams felt that they were being measured for things that we actually didn’t help them solve.

It was fair feedback on what was ultimately an unfair situation.

2) One Less Spreadsheet in the World

As I said, there’s absolutely nothing wrong with using a spreadsheet, or something quick and dirty, if it allows you to get something off the ground, gather feedback and improve as you go — however it’s important to accept when this has taken you as far as you can go, and recognise when that necessary “quickness to market” has started to become an impediment to future expansion.

We’d simply reached the point where maintaining this data in a spreadsheet became too labour intensive and data quality suffered.

In addition, it also became quite toilsome to gather and frame the scores, the associated SLO data, and the reporting materials needed for the monthly review meeting, the burden for which either fell on one of our Performance Analysts, or was hastily cobbled together by extremely busy Platforms just ahead of the meeting.

We needed to eradicate this sense of “slog”, make the data and reporting as simple as possible, and build a tool that prompted and encouraged conversation while enabling lightweight tracking and trend analysis over time.

As you can see from the screenshot below, we built a tool which we believe meets these requirements, and have worked closely with a number of our platform teams to ensure it is easy to use and prompts good conversation.

And on the topic of conversations…

3) We’ve made the monthly review meeting more focused and collaborative

I mentioned before that over time the monthly meeting grew in size and as a result a subtle shift around the culture of the Fundamentals emerged. Because we had to structure the meeting quite rigidly to make the best use of time, it meant that it became quite a dry “status update” session with, what’s fair to say, a “high dread factor”.

This is not what we wanted the Fundamentals to be about — culture is everything, and it was important that Fundamentals not become too “governance”-esque.

Solving this was incredibly important to us — while we’d achieved much with the Fundamentals, if we were going to be taking them to the wider tech organisation we wanted to make sure that we reset the culture around them.

We wanted to veer the Fundamentals back to simply being a common language that we could all use to understand what “good looked like”, which fostered a community and encouraged shared learnings — ultimately, prompting the change needed to help us continually satisfy customers.

With this in mind, going forward we’ll be having more targeted sessions every 8 weeks at the “domain” level (a domain is a set of platforms), and these sessions are less a “review” and more a conversation centred on these principles:

Involves a smaller group of people empowered to make decisions and priority calls which balance the myriad (and often) conflicting demands of operating software in a complex, fast-moving environment.
Is action-orientated, and by using the tool we’ve built, allows us to track outstanding work, commitments, and surface focus areas or hot spots which require external support more easily.
Is conversational and collaborative, and through closer relationships with teams we build a greater empathy and understanding of each others needs and challenges. Everyone works better from a place of psychological safety and trust.

We’ve piloted this format with some very helpful teams, absorbed the feedback and learnings, and we feel confident that this format is going to much more enjoyable, productive and meaningful.

What’s next

We’re going to be launching the new version of the Fundamentals over the coming weeks, and will be following up with an article on how things are going in a couple of months.

Alongside this exciting work, we’ve also got a number of other activities and initiatives going on which we’re going to be sharing with you in the coming months too, but some of these include:

We’re reinventing how we operate (pardon the pun) as an Operations function alongside our colleagues both inside and out of Tech, and a fundamental part of this is how we engage with teams going forward. Look out for more on this from our Head of Reliability and Operations, Jack Bramhall, who joined us from Cinch in January.
As discussed previously, we’ve recently launched our version of Spotify Backstage, and we see this becoming a critical part of our overall operations and engineering productivity ecosystem — I’ll be writing an article on our journey in this very soon!
We’re in the formative stages of creating an Internal Developer Platform which looks to simplify and (dare we say it) standardise how we deliver software at ASOS — this isn’t to constrain innovation or take the joy out of engineering, it’s to make it faster and easier to focus on the fun stuff and less on the “plumbing”. You’re likely to hear more about this from some of our Principal Software Engineers, who are driving this great work.

That’s all for now — hope you’ve enjoyed this article — please keep an eye out for the ASOS Backstage article coming soon 👍

Scott has worked in Technology for over twenty years, and has enjoyed software engineering and architecture roles at a range of organisations including The Football Association, Microsoft/Skype, EDF Trading and Avanade.

In his spare time Scott enjoys watching a range of sports (is a long suffering Spurs fan), sea swimming, weight training, natural health and nutrition, running, walking and reading.

He is passionate about men’s mental health advocacy and is in the process of writing a book about his lived experience of depression, and the stigma which still surrounds it.