Performance Testing for Peak @ ASOS
In this series of blog posts, we are going to talk through how we prepare for the Black Friday weekend, which we internally refer to as Peak at ASOS, so that we can provide the best experience for our customers over our busiest weekend.
In the previous post , we gave you a little bit of a background to how we got to where we are today. In this article, we are going to talk about the importance of performance testing and being confident to head into Peak.
Without further ado, here is the second part of the series — Performance testing for Peak at ASOS.
As discussed in the first article in this series, we operate at scale and to do this during periods of heavy and sometimes unexpected traffic, we need to be confident that the systems will perform well and be able to handle this traffic seamlessly.
As an example, in 2019 when we launched our sale, we delivered a strong social media push and traffic was alot higher than expected. However, we were simply able to scale our tech further, confident that we had already proven our ability to support unexpected levels of throughput.
How we prepare our testing model
For peak readiness, it is not just a ‘two months before peak’ activity, it’s year-round.
As soon as Black Friday is over, we spend time gathering statistics on how we performed that year. For each of our services we analyse our throughput, availability, performance and utilisation for our busiest hour. Over the entire period, this is typically on Friday evening.
We use the utilisation metrics to understand whether our scaling was appropriate; we need to remain mindful of cost optimisation and failover capacity when scaling for future sales.
We pull our throughput data for the busiest hour into a workload model in readiness for building tests that will prepare us for the following Peak. Once we have the throughput data for all of our services, we compare it to the peak hour throughput from the previous year to understand the %age growth factor for each service. We make an initial assumption that our growth next year will be as per our %age growth factor this year, but knowing that our growth isn’t always the same year on year, we add a further headroom %age to our forecasts.
We do our growth calculation on a per endpoint basis because we get quite a wide variation. We used to do a flat growth factor across all of our services but on further analysis identified that throughputs barely changed year on year on some services. We use Akamai as our content delivery network (CDN), so static content is heavily cached on CDN edge servers. When we’re testing we only simulate load to our origin servers, taking Akamai out the mix.
Once these calculations are done, our workload model is complete and we have target throughputs for every endpoint on every service, we have a model of load for the following Peak’s biggest hour. Now we have to modify our tests to enable us to simulate this traffic on our non-Production environment.
Creating our tests
Our workload model is complex and it’s critical that we have accurate throughputs on all of our endpoints to simulate real customer traffic accurately. Therefore, we need flexibility in our load test tooling. We originally built our tests using Microsoft’s VSO (Visual Studio Online), but they deprecated support for this. We looked at other tools that had a cloud-based load test offering, but none could give us the flexibility we needed to be able to support our workload model. We also do large scale performance testing day-in-day-out, plus our many engineering teams do component performance testing in smaller environments; the cost of this from any industry standard load testing tool would be millions per year. We ended up building an awesome in-house custom framework using JMeter.
Modifying our tests for our new load model is quite complex and can take a few weeks to nail. It takes a fair bit of tuning and tweaking of the tests to ensure that we hit our throughput targets across all of our endpoints without significantly overhitting. While proving our test pack, we’re also tweaking our scaling across the estate and proving that our services are supporting the targets on appropriate infrastructure. In the meantime, we’re doing our BAU testing; there’s no pause on deployments or major changes while we work on this! Once we’re there though, we have our recommendations ready for future major sales.
We are in a continuous integration and deployment model; we deploy a lot of change to the website over a typical week. There could be hundreds of changes so we run one of our peak load tests weekly as a regression test to ensure that we can still support our peak targets with our recommended scaling plan. Some of our changes are major and need specific targeted testing on their path to live and may require changes to our test pack.
Sales patterns change
We used to build all our test scenarios from the Black Friday workload model, focussing our efforts on prepping for the next Black Friday and our May sale, which was the other big event of the year. Now, our biggest sales are coming from flash sales.
Flash sales have a big spike in traffic at the beginning when we send out that first push notification, which then levels off a little with another spike at the end of sale. This is very different to the pattern of traffic for the cyber weekend in November. Also the profile of traffic is different, in flash sales many customers are moving items from their saved lists to their bag, causing a different profile of load. It became clear after a huge sale last January that we needed a different test profile. So we built a Flash workload model to complement our Black Friday workload model. This is another thing we have to maintain and periodically refresh using the same approach as outlined above.
Future Scalability and Resiliency
As well as our Black Friday and Flash load tests, we also have a scalability test that pushes our infrastructure to its limits. The purpose of our scalability test is to break stuff; we want to identify the limits that we can’t scale ourselves out of. We want to identify where we will need design or technology changes that will allow us to scale to support the workload we could have during Peak in three years, for example. This gives us enough time to take action and embed any new tech well in advance of hitting scale limits.
ASOS.com needs to have high availability, it’s the foundation to our business. With a complicated tech estate and a variety of third parties we need to be as resilient as possible to internal and external challenges. At a high level, we break or compromise parts of the system while under peak traffic, validating that our customer journey remains available.
For regression purposes, we run our peak load test and our flash load tests fortnightly on alternate weeks. But we’re mostly running these test packs much more frequently for specific releases or for tuning, especially in the run up to major sales. We aim to do a scalability test at least once a quarter. We have a large number of resiliency tests, each needing to be executed at least every six months, so our schedule has to be managed carefully, especially in the run up to Peak.
When we do our peak and scalability tests, we always run with the assumption that we have a regional Azure outage, flipping all traffic to the other active region(s), so we’re always simulating that worst case scenario.
We also perform soak tests regularly which runs sustained traffic for a full day. What we’re looking for here is problems that occur under sustained load, for example memory leaks, buffers and queues being filled.
When running tests we actively monitor, using Grafana as a visualisation tool for our end to end journey, and Azure Monitor for more detailed investigations into individual services.
Where we test
We complete all of our testing in non-Production. One of the benefits of using Microsoft Azure is that we can replicate an equivalent environment with configuration as close to Production as possible. We don’t tend to see issues in Production that we can’t replicate in our non-Production environment.
The benefit of running in non-Prod is that when we’re running breaking tests, such as resiliency or scalability tests, we are not going to actually risk affecting our customer experience.
We are all up for Production testing if it’s for the greater good, i.e. if it helps us meet testing needs that can’t be met elsewhere, or if it allows us to get rid of large scale test environments that are costly to set up, run and maintain. However, as a lot of our tests are breaking we continue to need a non-Production environment to support them. Testing in Production is OK, but only if it’s meeting a particular business need without adding undue risk.
Testing in Production is also complex — we can’t pollute the integrity of our Production data with fake customers and fake orders so we need our systems to be able to handle test transactions; we haven’t built all our services to be able to support test traffic. For now the benefit of testing in Production doesn’t outweigh the cost, but it always makes for interesting debate and we will likely take those steps one day.
Thanks to Cat Smith for helping make this post a reality.
Did you know that ASOS are hiring across a range of roles in Tech? See our open positions here.