Why Your Load Test Will Hide Problems and Lead to Crashes in Production

Published in

The Startup

13 min readAug 2, 2020

What do you do to make sure your website, backend, or API will scale to higher levels of traffic so that it doesn’t crash on the biggest day of the year for your business?

Why of course you do a load test!

Pick any tool, paste the URL, punch in a good number of users, and wait for a pass or fail. If it fails, fix the problems and run it again. Once it passes you’re all set.

Unfortunately, this is a recipe for being unprepared and having the system crash on the biggest day of the year for your business.

Not all load tests are equal. Or even remotely valid.

It’s very easy to create a load test that is completely detached from the real traffic conditions you will face and gives you misleading results.

It would be like having someone drive a car around a race track, and using their performance there to evaluate whether they will be good at navigating an unfamiliar city in a foreign language with stop and go traffic and drivers honking at you from all sides.

Your live traffic can be chaotic and have a very different impact than what you see in a simplified artificial scenario.

Strangely when it comes to testing backend performance this is often accepted as a good approach!

I’ve seen these situations with several websites, typically running on AWS clusters and handling millions of dollars and/or millions of visitors.

On all of these it was important to know when and how the system will break down so that can be prevented. However I saw load tests being run that would only provide misleading information. There was no way to fix the problems with that approach.

With a few changes to the load testing I was able to produce more realistic results that helped optimize the right parts and know the limits that can’t be changed in the short term. Those limits can be problematic but at least it gives the business valuable information to plan around.

Read on to see how I found effective solutions through targeted load testing!

Real World Load Tests

The controlled conditions of a load test can be very different from the real world results you will experience.

If you don’t do this right you may allocate way more servers than you need, you may spend months optimizing things that don’t matter and delay new development, or in the worst case you may miss the real bottlenecks and have crashes when you hit high traffic levels — the worst possible time.

All of those come with real costs. Having a few extra servers is not the worst outcome. The cost may be small relative to the engineering time being spent on keeping the system running smoothly. The others could be a real problem though!

How do you avoid that? It depends on the situation. I’ll break down where load testing goes wrong, using a couple of examples I’ve worked on.

Case #1: I worked with a large eCommerce site that had a complex interaction between several systems. It sometimes got overloaded and led to missing millions of dollars in sales.

A full load test outside of production wasn’t be possible due to all the components involved. So they had to be done in the middle of the night to test the production system while avoid downtime and lost sales.

An outside contractor was running the load tests and I noticed several issues that needed to be corrected so it produced valid results. A proper load test helped to identify the limitations of those external systems and make sure they didn’t get overloaded. It also showed that there was no need to optimize other parts of the stack that had more than enough capacity, and helped to avoid wasted development effort.

Case #2: Another site involved exhaustive testing to prepare in advance for a worldwide event that brought a lot of traffic.

Since I didn’t have data from prior years and there was only one chance to get it right, it took a carefully designed load test to make sure all the minor issues were optimized, the caching was layered and scaled correctly, and architectural changes to remove bottlenecks were successful in increasing capacity.

Prior to my involvement, there were several unanticipated issues that had led to the website crashing in the middle of the previous event. That created a very stressful situation as the team attempted to stabilize the servers while keeping the website up to date with what was going on!

After doing the load tests with my guidance and resolving the issues identified, there was not a single issue with the system on the day of the event. In fact some of the backend servers were running so smoothly they barely seemed to show any activity even with high traffic!

The goal was to be over-prepared so nothing could go wrong and we were successful in doing that (although the data collected from that day could then be matched up to the load test to get an even better fine-tuning of the capacity).

Common Load Test Failures

Here are some load test issues I’ve seen. When these happen they hide the real bottlenecks and produce results that are either too optimistic (showing that it’s fine when it actually crashes in production, with no sign of where the problem is) or too pessimistic (crashing in the test with a load that’s a lot smaller than actual traffic, giving no information on what the actual limits are).

Ignoring caching, repeating the exact same action

Let’s say you have an eCommerce site that involves customers searching for what they want to start the purchase process. There is some work done to process the search and then results are cached.

If you capture that user journey and repeat it 1,000,000 times, guess what — you’re just hitting the cache 1,000,000 times. Fail!

Your load test will tell you the site can handle unlimited traffic, while it keeps crashing in production. Why?

In real life you have customers searching for a lot of different things. Some will be more common and see a lot of cache hits. Some will be very rare and almost never touch the cache.

A proper load test will simulate the non-cached hits. The cache probably has very high performance (although we’ll address that later). The backend behind the cache is what you want to test first. And you want to see how many requests actually get that far.

To do this you need some idea of how many different varieties of searches there are. You can get this by analyzing traffic, and confirm it by running a load test that shows the backend activity is similar to the production site.

Once you’ve done this, a main goal for the load test will be to increase the variety of searches gradually and see where you hit system bottlenecks.

You can do this by creating a bunch of different searches in a load testing tool. In one case I wrote my own simple script that recursively looped through several variants to produce as many unique searches as I needed, and hit the website with those.

This script was sufficient since I was more concerned with how many non-cached hits could be handled instead of how many repeat requests could be handled, which is where traditional load testing tools are focused.

Why this is important: as traffic increases you will see more hits served out of the cache, but you will also see more unique long tail searches. You need to prepare for that. If there’s a major promotion for a specific item you may see a surge of traffic that is mostly focused on a few narrow searches that get cached efficiently. Either way, make sure your test is measuring the site activity patterns of your anticipated traffic!

Bypassing caching completely

The opposite effect can be a problem too. If you just skip the cache, or somehow make every request unique so it can’t be cached, you will quickly exceed the actual backend activity generated by real traffic.

Your load test might crash with 1/10 of the users you see in production, causing confusion. Instead you want the right balance where you get close to the number of non-cached requests that the backend has to handle.

Why this is important: as above, it’s not realistic! You might get lucky and actually fix real problems. But most likely if your test environment does not come close to production you will end up with a lot of wasted effort.

Complete overwhelm

On a site that regularly saw 2–3,000 concurrent users, and 5,000 customers at a time was a big day, the first load test I observed was configured to run 100,000 parallel processes.

Not surprisingly the results weren’t useful.

Once it was dialled back to a more realistic level, and adjusted to follow real life caching patterns, the results highlighted the real bottlenecks and led to new conversations about architectural changes and product design fixes. Those were very effective in making the site more stable, and didn’t take long to implement.

Why this is important: unless you’re running ads in the Super Bowl, you want realistic traffic. Can you handle 2–5x what you’re used to? That’s where you’ll get the most useful results. In rare cases you might need to prepare for a larger burst of traffic once you have done that optimization.

Testing the easy parts, not the hard ones

Imagine this system:

Your team is responsible for the front end and middle part. So you test them exhaustively, do a little optimization, and find that they are ok with 10x your regular traffic. Great!

Only one problem: the external backend crashes when you have 10% more traffic than a normal busy day. You aren’t really prepared for a higher load.

But since you don’t control it, should you test it? Yes! It’s better to know the limitations, find ways to work around those if you can’t change them, and then test the solutions to see if they actually prevent a crash.

I’ll write another article showing how I was able to stabilize a very unreliable system that couldn’t be changed in the short term, using a very simple fix!

Why this is important: even if you can’t reconfigure or optimize a system that’s limiting you, knowing what breaks it means you can put in smarter limits and workarounds. Then you actually can handle a lot more traffic.

Over-reliance on one load testing tool

Most load testing tools give you an easy way to repeat a sequence of requests many times across multiple different sources. As we saw, that often results in a load test that is not valid for your real world traffic.

Sometimes you can configure a pattern of requests in the tool that is close enough to realistic. In one case we ended up with a suite of 5–8 different tests that we ran at the same time. The result was exactly what we needed to validate architectural changes that allowed 3–5x as much traffic as before, which was exactly what was needed.

In another case that I mentioned earlier I just wrote a script that made curl requests. I could easily scale it up or down to do accurate load tests (or single-handedly crash the site).

Every tool has it’s limitations. Don’t let that prevent you from running the tests you need! You may need to run several tests at the same time in one tool, or use several tools to fully understand what you’re dealing with.

Why this is important: use the tool that allows you to do a good test. Don’t just do the test that the tool allows you to do.

Not testing the caching system

An easy way to avoid the cache interfering in your load tests (as mentioned above) is to either direct your requests to servers that are behind the cache so it’s not in play, or make requests that can’t be cached.

However you still need to make sure the cache is working correctly. Using a separate test or a modified version of that test that verifies the backend load, you can confirm that the cache hitrate is what you expect when you have repeated traffic.

And you can make sure that request parameters won’t break the cache. On one site I worked with, there was an issue before my involvement where a surge in traffic included a new URL parameter passed by Facebook.

The cache was not properly configured by this so it treated all traffic as uncacheable and had a hit rate near 0. This had to be hotfixed in production, with a very high level of traffic on the site, to bring the site back up.

If you have actions that trigger a full or partial cache flush, those should be tested too. You want to make sure that suddenly refreshing a significant portion of the cache during a high traffic period won’t overload your servers.

Constant cache flushing may take away most of the benefits of the caching system. But you can test a high frequency of these actions and see what the effect is. A system that I tested was able to handle cache flushing about 100x more than we realistically expected so it was clear this wouldn’t be a problem and we didn’t have to spend time on this.

Sidebar: while you’re doing this you might also want to manually check that the cache flushing actually pushes through updates. If you have several layers of caching, they may continue serving old content even after you flush a lower level cache.

Why this is important: if you have fairly consistent traffic, you can monitor and tune the cache in production to get good results. If you are preparing for a large surge in traffic then you’ll need additional testing to make sure the cache works as planned, has a good hit rate, and caches the right requests.

Not using the real distribution of traffic

A site that I worked with used a content management system where pages could be cached but it was critical for updates to go out quickly, so there was a short TTL on the caching.

Many of the pages had relatively low traffic and weren’t that complex to regenerate. However one of the highest traffic pages, that needed very quick updates, also created a very high backend load when it was refreshed.

There were also some infrequent user actions that couldn’t be cached at all without significant changes and there wasn’t enough time to do those.

If we ran a load test that simply did each of the requests 1,000 times that would give us worthless results. It would be a mix of cached requests that don’t tell us the true backend capacity, and hammering the non-cached parts way more than we needed — which could crash the system before the test gave any useful results!

The solution required a set of tests that ran concurrently. Some of them loaded the high traffic, high workload pages and did regular cache resets to see how they handled that. Others did the infrequent actions that couldn’t be cached, and ran slower to simulate the actual amount of traffic those would get.

That suite of tests was what allowed us to get useful results, find out actual capacity limits, confirm that fixes worked, and check the stability of each part individually to make sure there were no weak points. None of that would have been possible if it simply ran through a pre-set list of URLs for a predetermined number of times.

Why this is important: most of your traffic is probably focused on a few parts of your site or system. Your caching rules are probably optimized around this already but there may be some exceptions. If you want to test how it handles higher traffic, the test needs to approximate the behaviour of those real users.

Getting Your Load Tests Right

These are a few common scenarios I’ve seen. They all come down to the same thing: your load test has to replicate real world conditions.

It’s just like debugging in development. If your development environment is nothing like production, you wouldn’t expect to find bugs quickly and produce a fix that works well. The first step is to make sure you’re seeing the same thing that you do in production.

And that’s the best way to think about load testing. First make sure it produces the same results you see in production. Then increase traffic using similar patterns to create a realistic workload. As we saw above that may involve different mixes of cached vs non-cached requests and unique vs repeating requests. Once you find a breaking point that you want to prepare for in live traffic, come up with fixes and run the same test against them.

This takes more time than just putting your URL into a load testing tool.

Can you just skip all of this and increase your server capacity so you have no problems? Maybe. If the cost to do that is less than a few weeks of engineering time that could be the best solution.

But that essentially means you have no load test or scalability measurement.

You can’t explain to the business and marketing teams how much traffic you can handle (especially when you expect a huge surge in traffic that doesn’t happen regularly). You won’t know which parts of the system scale gracefully and which parts actually get less efficient as the load grows.

You won’t come up with the best options to shield the unreliable parts and still create a good customer experience. And the added servers may not even be at the bottleneck so you might not see any benefit.

If you’re seeing crashes during high traffic periods, adding more servers has not been effective, and it’s causing problems for the business, you need a real and accurate load test.

You might have a CMS that delivers a constant stream of content, or an API that handles user requests for a popular mobile app. Either way you need to think about similar concepts of caching and traffic patterns to design that right load test that will help you.

The end result, tailored to your specific system, is the only way to get accurate results.

Do this and you will learn the real limitations you need to address. Before long you’ll be impressing the rest of the business with the stability and scalability of the backend!

Interested in more insights about scalability and performance? Follow me on Twitter for future articles and quick tips.

Why Your Load Test Will Hide Problems and Lead to Crashes in Production

Real World Load Tests

Common Load Test Failures

Getting Your Load Tests Right

Written by Richard Garand