In this series of blog posts, we’re going to talk through how we prepare for the Black Friday weekend or ‘Peak’ at ASOS, to ensure we’re set up to deliver the best experience possible for our customers over what is our busiest weekend.
In this first post, we’ll take you through how we got to where we are today. For round two, we’ll talk about the importance of performance testing and being confidently ready to run at scale. Finally, we’ll wrap up the series by looking at how we prepare and run peak internally, alongside the lessons we learnt as we pivoted to remote working during the COVID pandemic.
Black Friday is one our busiest periods of the year, where we experience extraordinary customer demand, which, in turn, places huge demand on the business, and critically on our tech. 2020 was another record-breaking year, which saw us handling:
- Up to 1921 orders per minute, and 67 per second, at peak
- 11.92 billion requests to Akamai with a success rate of 99.96%
- And on the Friday alone, sending 3.5 million emails and 692K app notifications
So yes, Peak is very busy. But it’s also fun and rewarding, with all our teams pulling together in pursuit of that same common goal — delivering for our fashion-loving 20-something customers.
Over the Black Friday weekend, Engineers from our various platform teams sit alongside their colleagues in Service Delivery, App Support, Reliability Engineering, Infrastructure and Networks, and Data — everyone studiously monitoring and making sure our systems are running smoothly so that our customers can grab those deals. Of course, for 2020 most of that collaboration had to be done remotely, with a core skeleton team on-site, but we all shared that common experience and the satisfaction of a job well-done.
To get ready for the weekend itself however, the work starts well in advance — so we’re going to take you through the build-up, how we prepare, and how we make it all happen.
Firstly, let’s cover a little bit of background on our re-platforming efforts and how we’re organised.
One of the key enablers of our future growth will be the ability of our tech to hold up at even greater scale. To help us achieve this, we’ve been migrating our technology from its original architecture to being fully hosted in the cloud. Although we still have a few systems currently residing in our on-premise data centre, we’ve made great progress on this and are already benefiting from the increased scale limits that cloud provides, which far exceed the capability provided by our on-premise systems alone.
Our first-generation order processing system was based upon a database that grew to gigantic proportions, alongside older Windows Services and messaging systems. Replacing that legacy estate is very much part of BAU, but in 2016 we accelerated a programme to re-platform that architecture.
We’ve replaced our previous compute strategy with Kubernetes and Serverless, and our databases with Cosmos Db and SQL Azure. Much of our order processing system workflow now uses Azure Functions to operate, and we’ve had great success with them. Cosmos DB also gives us much more elastic scalability so we can increase our throughput during the busy periods and turn it back down afterwards.
As a business, ASOS runs on its technology. We are a pure-play online retailer — and we, of course, require many hundreds of services to deliver the capabilities we need to operate.
But how do we manage that breadth of services and ensure they are maintained and supported, especially during a busy sales period?
The answer is in our organisation. Our tech teams are broken down into Platform teams, with each platform owning one or more logical services. Here is a picture that shows how we currently organise in a platform.
A Platform Team’s goal is to own the code from inception to decommissioning: the product strategy, the build, execution, and the support. With this model we have seen our teams really show innovation and entrepreneurship in taking full ownership and accountability to enhance their service capability and run their services in Production.
Throughout the year, teams are performance testing their systems and re-platforming services where necessary to get the scale we need.
They work in conjunction with the Performance Team who test the end-to-end customer journey in a Production like environment, ensuring we are confident that we’ll maintain an awesome customer experience under very high loads.
It’s all about getting the customer order from the website to the warehouse as quickly as possible because we have cut off times that we need to achieve. We might have a next day delivery offer for example, in which case we need to get the orders to the warehouse in time to ship them next day.
Our website is made up of a number of different logical services. Each one may have an API, database, cache or message bus and these components make up that logical service. We monitor the activity, performance and utilisation of each of these components within each of our core services.
With any complex operation it’s inevitable that issues arise — that’s life! We’ve had our share of bugs to power cuts and so it’s essential we have a Plan B should something happen.
When running at very high loads, it’s vital we react to issues quickly. Retry loops can apply even more load and errors can propagate to downstream systems. To counter this, we build in circuit breakers and procedures to help us fail quickly and gracefully, and prevent knock-on effects. It’s essential that our mitigation works and so we review and test these safety valves and procedures throughout the year.
Why peak at ASOS works so well
Before we deep dive in later articles, we thought it would be good to share a few thoughts on what makes Peak at ASOS successful.
ASOSers all recognise how important Peak is — it’s at the forefront of everyone’s mind which means our people are working together towards a common goal.
Peak is extremely important to us as a business, not only for our sales but also for increasing our customer base globally. During this period, we attract a lot of new customers who are then retained throughout the year. There is great appreciation of this at leadership level, hence being given the time to focus on it year-round. If we tried to do this two months before, it would be difficult to hit our target to give our customer a great experience during peak traffic.
It’s a tough weekend with long hours, however, it always finishes with that high of a job well done and great teamwork. At ASOS, our purpose is to give our customers the confidence to be whoever they want to be, and the same goes for our people too. By planning effectively, and working together, our teams are able to have the confidence they need –in our tech, our systems and our delivery, time and time again.
Want to know more? Look out for the later posts in this series where we will deep dive into performance and resiliency testing as well as how we organise ourselves across the business to be able to quickly adapt to meet the needs of our fashion loving 20 somethings.
Thanks to Andy Potts, Cat Smith and Chris Trenter for helping make this post a reality.
Did you know that ASOS are hiring across a range of roles in Tech? See our open positions here.