Load testing at Boozt

Boozt Tech
Boozt Tech
Published in
8 min readAug 9, 2021

A new way of load testing at Boozt. Leveraging Markov chains to create more realistic load testing scenarios.

Since 2017, Black Friday preparations at Boozt included several load testing sessions to measure the impact that the increased traffic would have on our webshop systems, and it has proven to be extremely useful. Just last Black Friday in 2020, load testing helped us find an improvement possibility in the checkout phase, and after sending the report to the team responsible for that domain, they addressed it and it made our users experience smoother (and our SREs happier).

Why do load testing?

The objective of load testing is to have a clear picture of how a system would react while being accessed by multiple users concurrently. The aim is to determine that the system being tested is ready to handle the amount of traffic that is expected, but at the same time also figure out the upper limits of the system, in terms of maximum capacity of concurrent users.

When writing a load test, the developers create what is usually called a scenario. In a scenario, we will perform the same logical steps that a real user would do: for example, we could have a scenario where we access the home page, proceed to login, then search for an item, visit its page then add it to the cart. In such a case, we would be hitting important parts of our infrastructure (database reads and writes, search engine, perhaps secondary services), and if we run this scenario with tens of thousands of concurrent users, we will greatly affect all systems, and with proper observability and monitors in place we will be able to gauge how ready we are.

When writing a load test in this way we are going to apply uniform load over all parts of the infrastructure that are involved with honouring the requests — in other words, every part will have to handle the same level of load.

On one hand, this is useful because it lets us quickly identify which part of the infrastructure is less ready to handle increased traffic and is therefore the highest priority area to fix; on the other hand however, by stressing every section of this chain equally these static workloads can push the limits of the infrastructure just to the first point of breakage, but not further: if we want to see how the webshop handles 20.000 users and we get stuck on 5.000 users logging in at the same time causing our databases to lock, we’ll never get a clear picture of how the rest of the infrastructure performs with 10.000 or 20.000 users.

This simulated behaviour also does not have the characteristics of a real user: when shaping the traffic this way we are not reproducing the kind of traffic patterns that real users would be generating, and while it is definitely a useful tool to have at our disposal, it would be beneficial to push even further, to assess the limits of our systems when confronted with arbitrary levels of realistic user behaviour.

Towards more dynamic load test scenarios

We definitely know that there is a lot of variance in the various ways our users use our webshops: pick 100 random people from the street and ask them to interact with Boozt.com while recording their session — some will login, some will signup, some won’t do either, and in general all will visit different pages in different order for different amount of time.

Our load test should try to consider all of these many and different actions, but how can we encode this variance? After all, taken singularly those recorded sessions will look different from each other, sometimes radically so.

The key intuition for us was that, instead of looking at each individual session, we could observe a big number of them and try to identify patterns. This proved to be an opportunity to leverage data from various sources, such as our monitoring tools or Google Analytics statistics, to try to understand the different ways our users interact with our webshops.

Websites as state machines and Markov chains

Armed with actual numbers we could then identify the major patterns of usage of the areas of our websites that we wanted to have in our load tests, deemed important both from a performance perspective and due to the impact they have on our infrastructure.

These patterns can be translated into a model by considering every major operation (such as visiting a given page or performing a particular activity) as a state in a state machine, where the transitions between states are regulated by probabilities.

Markov chain

A tool at our disposal to work with these complex state models is the Markov chain. Markov chains are a mathematical model of some random process, describing a series of states and the transitions between them, and where each transition is governed by a probability (and only that: this is called Markov property); for our load test scenarios, we could see each page or high level operation that we want to simulate as one of the states in a Markov chain.

However, the constraints posed by the Markov property create an issue for us: typically, when using a website our navigation patterns do not only depend on where we are at any given point in time, but might have dependencies on something we did earlier; for example, we will not go to the checkout page if we never added anything to our cart during a shopping session. In this sense, the Markov property could complicate how we write our tests. We immediately identified the need to correctly handle the state of a user session, and we wanted to make managing it as easy and intuitive as possible, so that writing tests would be less of a chore.

Choosing the testing framework and writing the test

The framework we used until this point was not going to provide us the freedom we needed to implement these requirements, and with the contract about to expire we decided to go shopping. We surveyed around a dozen different offers, between testing frameworks and test runners, grading them in different categories such as ease of use, vendor lock in, quality of the integrations available to them and APIs.

Of the various solutions for writing our load testing scripts that we tried, we were particularly satisfied with both JMeter and Locust. For this particular use case, we decided to select Locust: we found that it stayed mostly out of our way, providing the necessary scaffolding while giving us space to define the complex scenarios we intended to code.

A Locust test defines a User class with the behavior defined in the methods decorated with @task. When we start a load testing session, we provide to the Locust script a file defining our states and their transitions, with the probabilities; a simple scenario could look like this:

states:
home:
login: 0.3
search: 0.6
nothing: 0.1
login:
search: 1
search:
search: 0.9
nothing: 0.1
product_page:
search: 0.5
add_to_cart: 0.4
nothing: 0.1
add_to_cart:
search: 0.2
payment: 0.75
nothing: 0.05
payment:
nothing: 1

Each key in this YAML dictionary defines a state, identified by its name, and its values are the other states that we could transition to, together with the probability to do so.

This file is then validated and processed, and from it we generate a transition matrix that is fed to a simple Markov chain implementation, built on top of numpy. Each User, when starting, will utilize this Markov chain to generate the states that it will perform during its execution.

Every state is mapped to a function that will perform the actual HTTP requests and interactions needed, and every time a function is called, we might need to store some state needed in the future. The need to have an easy way to manage this state, as mentioned above, was clear to us from the start: having a complex state to keep track of is a consequence of the complex scenarios we want to execute. Because of the Markov property, when executing a function, we never know the path that we took in the website to get to this point: for example, we could be visiting a page for a brand and we could have gotten there from several different pages, and depending on this path we might want take different decisions in the brand page (also, back navigation is a thing our users do, and it’s therefore something we want to simulate).

Additionally, some state transitions are invalid — such as trying to logout without having previously logged in.

In order to handle this state, we took inspiration from another tool we use, Elixir, particularly its GenServer behaviour, and the way a GenServer streamlines the handling of state. Every time a GenServer callback is invoked, the GenServer will call the function giving to it as input the last known state, kept in a basic data structure, and expects the function to return the state, possibly modified. We found that this architecture — receive the state as input, return the state as output, but never store it anywhere — simplifies a lot the complexities inherent with state management, so we decided to adopt the same approach. Our functions would receive an instance of a State data class containing everything that was collected until that point, and would then mutate it, adding or removing from it according to the internal logic and return it as output. With this architectural choice, our test became trivial to generalize, and the main method that runs these complex scenarios is just a few lines long:

@task
def random_user_flow(self):
pages = self.chain.generate_steps ()
state = State ()
for page in pages:
fn = getattr (self,page)
state = fn(state)

Thanks to this structure, we could isolate each high level operation in its own unit, and we are able to validate the operations based on their internal logic and the session’s current state: for example, should we log in we could save this information in the State instance, and afterwards, should the log out operation be performed, we would first validate that it is possible to do so by checking the isLoggedIn property of the State instance.

The most important consequence of this architectural choice is that every function must return a state valid for all other possible valid transitions: due to the randomness of the process, we never know at any given moment which operation we will execute next, so we must be prepared for all of them.

Thanks to this simple core, we managed to easily generate scenarios simulating complex interactions; the flexibility provided by a mature programming language such as Python is for us a big boon, because it lets us define how a simulated user should interact with a page precisely the way we want. In addition to this, it is easy to codify other aspects that make our load tests more realistic, such as small differences in how each User behaves individually in the shape of different timings between operations, or applying randomness in the various interactions with a page.

Conclusion and future developments

With the introduction of this approach, novel for us but already with some existing literature in the academic world, we are pushing the boundaries of how reliable we can make our infrastructure, and the fact that with this effort we managed to blend traditional business needs and KPIs to data driven decisions and software architecture descriptions is what makes us particularly excited about it, and we are looking forward to build even more on top of it.

About the author

@ Mattia Ziulu | Platform Engineer at Boozt. Sardinian in Sweden, I have opinions about programming languages.

--

--