Engineering & Operational Excellence at “ASOS” scale — Part II

Scott Frampton
ASOS Tech Blog
Published in
6 min readFeb 15, 2024

--

This is the second in a series of posts where we discuss how we continually improve our engineering and operational practices to satisfy the needs of a global customer base while meeting the challenges of operating an increasingly complex technical estate.

In the first article of the series we covered the ASOS Fundamentals, which is a framework we use to guide teams in building more reliable software.

Today we’re going to be talking about our adoption of Spotify Backstage, our journey so far in establishing the platform and how we’re using its software cataloguing capabilities to build out a holistic view of our ever-growing software estate.

Firstly, what is Spotify Backstage?

Spotify define Backstage as “an open source framework for building developer portals”, and it exists to:

  • Provide a central hub for your engineering organisation’s tools, resources and documentation
  • Reduce everyday friction, cognitive overhead and operational toil, so developers can stay in their flow state longer
  • Provide an extensible, plugin architecture that allows the platform to be customised to meet the needs of any organisation

That said, central does not mean single…

As we all know, there’s no one tool that solves every problem, and Backstage is no exception. Backstage isn’t your build and deployment pipelines, it isn’t your primary monitoring suite and it isn’t the software you use to manage incidents, it simply stitches these systems together, making it easier for people to get to the information they need.

It’s not just about engineers either…

While engineers are a primary audience of Backstage, with features geared towards engineer productivity across the platform, Backstage is not just about engineers. In fact, our journey started from an operational footing and was driven out of our Site Reliability Engineering and Operations functions to help us establish and maintain a consistent understanding of our technical estate.

Why is it important that we understand our service estate?

The technology organisation at ASOS has grown significantly over the years, not just from a personnel perspective, but in technical complexity too. Over 160 teams operate hundreds of services comprised of thousands of components, many of which are being released regularly into production.

With this pace of change, and with teams operating autonomously as I described previously, there was no way for us to understand the overall “lay of the land”, which introduces challenges and scope for delays in resolution when you face incidents or problems that cut across services owned by multiple teams.

An example of this might be an outage of a key platform technology. While we build fault tolerance and resilience into our software as part of our standard engineering practices (guided by the ASOS Fundamentals), there are some situations for which you can’t reasonably guard against (without huge, sometimes prohibitively expensive investment).

It’s when these incidents occur that you need to be able to identify those affected, swarm around them and optimise efforts to return service to customers, and in our case when these issues occur, these efforts are typically orchestrated by our central operational teams, and nothing slows those efforts down more than not being able to quickly identify who owns a service.

Optimise your response to failure — don’t try to eradicate failure completely - it’s an inevitable aspect of operating complex distributed systems.

We needed a reliable way to catalogue our software estate…

When you’re talking about services which number into the hundreds, it’s not practical to maintain a register of services in a central spreadsheet — or indeed in any tool — built on the premise of “centralised” administration. In an organisation as fast moving as ours, it would be out of date within minutes or hours of it being updated.

What was compelling to us about Backstage is that the Software Catalogue capability harvests software metadata from code, allowing us to place the responsibility for maintaining data correctness in the right place — in the teams, using the tooling and languages they use every day.

Using the structured (but extensible) data model and code-based process that Backstage is built on, we are able to define the types of metadata we require teams to provide. We get the best of both worlds — we federate ownership of the data to the community, while at the same time setting the parameters that ensure the quality of the data collected is high.

A discussion of the structure of these metadata files and how the ingestion process works is beyond the scope of this article, and is comprehensively detailed by Spotify themselves, so I’ll focus on the end result — what does this ultimately give us:

A nice interface which allows our teams to view rich, useful information about the systems, services and components which make up our estate:

From here, users can understand which team owns a given service, which other services it integrates with, and are also signposted to other useful information about the service and the team, and this benefits not only engineers building the software, but people outside of the team involved in managing incidents.

Additionally, this data is exposed via a REST API, which allows us to integrate this consistent service metadata across a number of other applications, something we’ve started doing in earnest.

How did we drive software modelling efforts and general adoption?

Establishing a new tool which demands people adapt their ways of working is never easy, however our principles for adoption were:

  • Clearly (and repetitively) articulating the problem we were trying to solve, and the benefits solving this problem would bring to everyone.
  • Getting senior leadership buy-in — this is crucial.
  • Wrote detailed documentation which helped teams understand what and how they were supposed to model, which we evolved from our continued learnings.
  • Emphasised progress over perfection, and keeping things simple. We focused first on modelling Systems, Components and APIs, and are only now exploring the feasibility/usefulness of modelling individual resources.
  • Introducing a Badge system to acknowledge and recognise people who actively engaged and contributed to the initiative — everyone loves a badge!

Where are we now?

We now have over 400 systems and 1000 components modelled, and this data is being directly fed into a number of our other systems, which as I’ve said before, gives us a common, consistent thread which ties data about each of our services together across our ecosystem.

What’s next?

We’re far from done, not only with software cataloguing, but exploiting the other features Backstage offers us, and the next article in the series we’ll talk more about how we are:

  • Extending the base data model with even more of our own metadata, providing more service intelligence and relevance to our teams.
  • Using the Software Templating capability to distil our best practices into configurable templates which teams can use to provision a range of applications and the associated Infrastructure-As-Code, all through a wizard-like interface.
  • Building custom plugins which surface and signpost data from our various systems through Backstage, effectively serving as a one-stop Engineering Portal for our community.

We’re very excited about a future with Backstage, and through increased community adoption and engagement, we see no reason why this tool can’t significantly improve the day to day lives of our engineers, improve our incident response procedures and ultimately boost productivity.

Scott has worked in Technology for over twenty years, and has enjoyed software engineering and architecture roles at a range of organisations including The Football Association, Microsoft/Skype, EDF Trading and Avanade.

In his spare time Scott enjoys watching a range of sports (is a long suffering Spurs fan), sea swimming, weight training, running, walking and reading.

He has recently gained certifications in plant-based nutrition, and is passionate about nature, the environment and helping people transition to a plant-based diet to support strength training, general wellness and to reduce our impact on the natural world.

ASOS are hiring across a range of roles in Tech. See all our open positions.

--

--

Scott Frampton
ASOS Tech Blog

Principal Software Engineer - Reliability Engineering & Operations