Preparing digital John Lewis for peak events — Live Load Tests

Published in

John Lewis Partnership Software Engineering

9 min readMay 16, 2022

Hi, I am Mary, Product Engineer at John Lewis Partnership. Two years ago, I joined team Sonic as a product engineer. Team Sonic focuses on improving the overall website’s performance. I joined the team to help with the website’s live load test, which is periodically run to minimise the risks of not being able to cope with a large number of customers searching, browsing and buying items during peak events like Black Friday.

https://www.johnlewis.com/black-friday/c6000670128 — JohnLewis website — Black Friday 2022

Not having a great deal of experience with performance testing meant that I had to go through a steep learning curve in what is a complex test that involves multiple teams managing a large number of evolving microservices running in the cloud that also integrate with on-premise and third-party systems. It was and still is a daunting activity, but having the support of a team with knowledge, experience, curiosity and commitment has made the experience less intimidating and a great opportunity for continuous learning and contributing to the business success.

A live load test in John Lewis simulates real customers’ behaviour on the website and native mobile app by using a large number of virtual users making requests to search and browse products, place orders, login into the website, adding items into a wish list, etc, which allow us to assess its performance by gathering important information such as response times and number of errors per microservice and type of request. The assessment will help us identify areas of concern that are discussed with the appropriate team or teams and corrective action is taken if needed, to ensure it will perform well under a high level of demand.

So, why run a Live Load Test, a.k.a. LLT?

At JohnLewis.com, we have a number of peak events where discounts are applied to a large range of products which predictably attracts a high number of visitors and a higher number of orders placed online. Typically, we have the following peak events during the year: Summer Sale, Black Friday, Cyber Monday and Boxing day.

One of the main reasons to run a live load test is to minimise the risk of losing sales due to the inability to handle a high number of visitors and orders at peak levels. It is not only a question of ability to handle high levels of traffic, but also how well it is performing as it is widely known that customers quickly move away from websites or apps that take longer to load pages or respond. Performance needs to be addressed in 2 fronts: Backend and Frontend. The LLT addresses the backend performance, which is an in-house Gatling implementation and the tools for client side performance we use are Sitespeed, in-house monitoring and alerting for Core Web Vitals using Google’s Chrome UX report API and coming soon, WebPageTest, which all work together to highlight when there is a performance problem.

It is important to highlight that the live load test is not the first time a service is tested to peak levels. Every team has the responsibility of designing, configuring and testing their services to cope with the demand it would be subjected to during a peak event. Teams generally include performance tests in their pipelines. The LLT is the last opportunity to identify a performance issue that could potentially impact the Live environment and it is important to execute it in production as it is only there that the actual components (infrastructure, configuration, networks, etc.) that impact real customers are tested.

Recently, we have also experienced what we call a “hot product launch” type of event, which refers to selling a product or a small set of products online with high demand in the market and low levels of stock, which tend to be sold in a short period, usually minutes, as it is the case with the launch of new game consoles. This type of event has its own traffic profile and is not discussed in this article as it generates a different type of customer behaviour. The hot product launch involves a large number of customers accessing one particular product either by accessing the product details page directly, in a search result page, having previously added to the wishlist or basket, etc.

How to define Live Load Test’s targets?

Traditionally, load tests mainly focus on the number of concurrent users but, for a retailer like John Lewis, order rate is also a critical target because it is the direct income source and the test needs to ensure it is able to process all orders under peak loads successfully and efficiently. Therefore, the 2 main targets tracked by the test are:

Number of orders per second
Unique Number of Visitors per minute

Defining the target level is not an easy feature as nobody has a crystal ball to predict exactly what the real peak number of visitors and number of orders per second will be. However, there are a few elements that can be taken into account and aid the process to define them.

When available, analysing actual historical data for previous peak events indicating the number of visitors and order rate reached, is a great starting point. It is a popular rule of thumb for a load test to use a historic peak figure plus an extra comfort figure, usually 25%. The extra level increases the confidence that the site will be able to handle the traffic levels during the peak event. Also, we ensure that our test’s targets are aligned with the business goals for the event.

Example of historical graph with peak data (does not contain real data)

But it also is important to consider contextual factors that might influence the targets for the next peak event that your are preparing for. Considering the sample graph above, the conditions that generated Peak A might no longer be present and therefore, it might worth considering setting the new target using peak B plus the extra percentage comfort rate instead.

The extra percentage added on top should also consider the many factors that might affect it. In our case, last year when we went into lockdown due to COVID, stores were closed for some time and then started to reopen just before Summer Sale. Customer behaviour changed and sales shifted to online as customers felt safer buying online rather than visiting a store. In this instance, before Summer Sale we were expecting a large number of customers online, so we increased the targets using a comfort uplift of 50%, which proved to be the right thing to do as order rate reached those levels, albeit for a few seconds.

What is included in the LLT?

While in an ideal world perhaps the best thing to do would be to test everything, QA Engineers know well that in many cases that is a never-ending game. In some cases the effort it requires to test all possible scenarios is not feasible. Also in some cases, the actual process cannot be tested given the consequences, for example, during a test we do not want to process real payments as we don’t want real money to be changing hands for a test, so finding an alternative approach that simulates the real process, is the right choice. The general testing principle that QA Engineers know about risk-based testing is that risks should drive the scope of testing. This also applies for load testing.

The LLT mainly simulates the following aspects for John Lewis’ website and native mobile apps:

Search and Browse (S&B) behaviour
Checkout and Payments
Access to My Account

To simulate S&B behaviour, we use the activity logs that capture real customer’s requests during the day which then gets skewed to match peak traffic instead of BAU traffic. In our case these logs are stored in BigQuery and we use the following pseudo query to extract them:

SELECT timestamp, userAgent,method, url, queryParams, responseStatus
FROM `<service_dataset>.stdout_<date>`
WHERE…
AND timestamp between TIMESTAMP(<start_datetime>, ‘Europe/London’) AND TIMESTAMP(<end_datetime>,’Europe/London’)

Once extracted into csv files, this BAU data gets manipulated and transformed into what resembles the load profile of a peak event. The test will then use the transformed csv files as feeders and replay these traffic using the url, queryParams (if applicable), method and user agent, expecting a response code equal to the one obtained in status.

To simulate the Checkout journey, the test replays the different steps that it takes to place and order, from viewing a product, adding to basket, logging in, making a payment and viewing the order confirmation. This simulation includes a dropping rate as it happens with real customers, which refers to the fact that many start the checkout process but some do not complete it, dropping during the checkout steps. As we don’t want payments to go through, i.e. real money exchanging hands and also we don’t want these orders to be fulfilled, test payments get redirected to a stub version of the payment system and the orders generated by the test are intercepted and discarded before they reach the order management system.

The test uses real products which get reserved from the stock as if it was a genuine order, so the products the test simulates to buy are only those with a high volume of items in stock. For the test orders we make the reservations expire automatically after a shorter period than normal.

My Account simulation uses test users and tests the authentication service and some operations that only logged in customers are allowed to do.

How frequently do we run the LLT?

Our services are under constant change and any change has the potential to introduce a performance issue that can only be detected in Live under realistic peak levels of traffic so, the LLT runs weekly, providing a regular feedback about the performance of our services.

On occasions, we schedule ad-hoc tests if an issue with potential to cause serious problems to the website or the mobile apps is reported as a result of a previous LLT or when teams have applied corrective actions that need to be tested.

We ran two types of test: one fully automated that requires no manual steps to be executed as prerequisite and a fully supervised test where engineers from our team oversee the test execution. For the automated test, there are circuit breakers implemented so that it will stop traffic at any sign of difficulty for any particular type of request and stop the entire test if many parts show signs of struggle. Since the introduction of the automated and unsupervised test, we have been able to run it regularly without the cost of engineers supervising it and kept running the supervised test more frequently as we approach a peak event.

What happens after a test?

Once the test is done, we start a process to analyse the results mainly using the output report from Gatling, the various services’ Grafana reports and NewRelic. A formal report is also produced which contains high level metrics (capacity, latency and errors) in each test category: S&B, Checkout and Payments and My Account for the website and mobile app.

There is a utility developed by the team that will process the simulation log file created by Gatling and produce the data for the metrics that we later use in the report. The following is an example of the error rate by category:

Example of high level error rate figures generated by the metrics calculation utility

Reports broken down by request category are also produced, so that we can drill down on what exactly is causing the error rate to go up if the percentage is high or simply check at a lower level if the percentage is acceptable for all requests types.

The following graph is an example of the analysis we do by category and helps us report how latency was observed during the test. In this case it was stable and we report the average but if a spike is observed, we would report the percentage increase and dig down on what type and requests and services were the cause of the spike.

Example of Latency graph for S&B — data provided by metrics calculation utility

All issues highlighted in the test are identified and weighted against the risk it presents to the site. Communication with teams to highlight issues starts even the night the test is run, and finally a formal report is produced which is shared with all teams reporting the capacity reached for number of visitors per minute and order per second, latency and percentage of errors by category of traffic. We encourage teams to create their own LLT results reports. Collaboration from all teams to investigate and find a solution to any potential problem is key to make the LLT a success.

At the John Lewis Partnership we value the creativity of our engineers to discover innovative solutions. We craft the future of two of Britain’s best loved brands (John Lewis & Waitrose).

We are currently recruiting across a range of software engineering specialisms. If you like what you have read and want to learn how to join us, take the first steps here.