Load Testing

A Little Background

8 min readMay 20, 2016

The daily fantasy sports market is growing significantly — the number of users we have is continuing to multiply every year. American football (NFL) is the most popular fantasy sport by a fair margin; annually acquiring the most new users. This means that around September every year — the start of the NFL season — we see new peak loads that are significantly higher than previous maximums.

With no ‘warm up’ or slow ramp up of traffic, we need to be confident that the platform will hold together when the first NFL Sunday arrives and the floodgates open:

Our users need to be able to edit their contest lineups right up to the point where the contests start. If a popular athlete is injured or otherwise taken out of a team’s starting lineup (known as a “late scratch”), then there will be a mad scramble to replace them. This behaviour is exacerbated in the last hour before game start.

With only a few minutes until game start (when all our users’ contest lineups become ‘locked’), it is critical that our platform can handle these sudden and significant traffic spikes. This situation is further compounded by the write-heavy nature of the traffic. FanDuel puts a lot of effort into calculating expected levels of throughput and then load tests the platform to those levels and beyond.

Acquiring Target

Initially, we approached the project from a “stress testing” perspective, but that name implies testing to the point of degradation or destruction rather than to a specific load target. In some ways, stress testing is easier than load testing — one can just keep adding load until something explodes. Knowing when a system degrades or explodes is interesting, but it’s only half of the story. The other half is determining the load that the system needs to handle. Most people don’t care if the system blows up 1,000 x the expected peak production load, but this brings about the question: How do we know what the load targets should be?

We identify the expected number of users based on two factors:

the existing user base
information collected during FanDuel’s marketing campaign in the lead up to NFL

We then combine the upper end of expected user numbers with models of user behaviour derived from previous years. This gives us the expected maximum throughput for various user actions such as sign up, deposit, enter contest and so on. It also gives us some idea of how these actions may be combined. For example, it’s common for a user to sign up, immediately deposit and then enter a contest.

The load tests we perform fall broadly into two categories: individual system tests and integration tests. We discovered that doing a good job of the former activity makes the latter much easier. We’re well on the way to a fully service-oriented architecture, and having each service provide comprehensive self-monitoring is a huge benefit. I dream of the day when a single monitor shows a bunch of green boxes with interconnecting green lines, and smiley emojis popping all over it like fireworks. Maybe one for our next hackathon!

Tools — Test Runner

Many of the internal systems under testing operate some form of request / response cycle. Some are externally-facing web systems (several different systems drive various parts of FanDuel’s main website), others are internal services for things like user authorisation, payments, entering contests and so on. We did some experiments with Locust, Tsung and JMeter, Locust seemed the best fit for a number of reasons:

Locust is heavily geared towards testing websites, with good support for HTTP request/responses.
Since Locust is implemented in Python, we are able to plug our pre-existing internal service interface packages (which also happen to be written in Python) straight into it, making it very easy to extend Locust to load test a number of non-HTTP systems (see Locust’s docs for an example of how to do this).
Locust provides a decent set of events that makes hooking our own monitoring into it quite straightforward.
Locust supports a master / slave setup out of the box, which means it’s easy (with enough slaves) for it to produce the levels of load we need; which is well above 100k requests/sec in some cases.

We use Ansible to provision the Locust cluster as it’s well suited to the on-demand provisioning that we need when the tests rapidly change. We use Puppet to provision the actual systems under test. This setup gives us confidence that the test systems are actually representative of those in production. The Locust master provides a basic web interface, letting us start and stop tests and control how many simulated users are to be used and how quickly we can ramp up the load. During and after the test, it displays a table of timing statistics: min & max response times, requests / sec and so on. This information is useful as a snapshot of the running test but we’ve also hooked into a number of the events provided by Locust to report statistics to our monitoring platform.

Tools — Monitoring

Historically we’ve used Graphite to monitor systems, with StatsD and collectd used to gather the underlying data. Recently we’ve started migrating to Datadog; a commercially hosted service. DataDog provides the same basic functionality as Graphite, but augmented with several improvements and new features.

There are two key metrics we want every system to report:

Its own response times
The response times of any other system with which it communicates

We’re primarily using Datadog’s histogram metric via its own beefed-up StatsD daemon for timing measurements. This provides the basic min, max, total, rate etc. Datadog also allows us to tag metrics with any number of arbitrary key-value pairs, and you can then filter, aggregate or compare data based on the tags, as well as the main metric name. This lets us slice the same metric a number of different ways, without having to report it under different names, as you would need to do with vanilla StatsD. We can report a metric like ‘my-services-redis.response-time’, and tag it with things like ‘command:get’, ‘shard:123’ etc. We can then easily see the average Redis response times for our service as a whole, as well as break it down by command, shard, or any other tag we’ve set. This is a big help if we have, for example, a single badly-behaving shard in a large cluster.

We monitor various non-timing metrics on a per service basis — thread or connection pool counts and statuses, number of connected clients, and so on. There are Datadog integrations for a wide variety of services, and we are using these for most of the services that we didn’t write ourselves. Doing this gives us a way to trace the response times of the (public) front end all the way down through the stack, with each system reporting its own metrics as well as which sub-systems are slowing down. We monitor CPU, memory, network etc on all the instances, so we are able to see at a glance if a system is overloaded.

Combining the information from each system with the stats reported by the Locust slaves, we have a complete picture of the throughput, response times and statuses, all the way from Locust at the outside edge, through to the various backend data stores.

Example

Here’s an example of some of the data captured during a system test of one of our core services. We configure Locust to ramp up to 40,000 users although the test is manually stopped around 20,000. Each Locust user is configured to log in and then make 10 authorisation requests / second. There is a small chance of logging out and then logging back in as a different user after each request.

The number of requests / second increases linearly up to around 140,000 / second at which point the response times from our service starts to degrade. Both Locust’s statistics and the service statistics agree that the response time is degrading. We can therefore reasonably deduce that the bottleneck lies within the service component, rather than the network layer between the Locust and the service instances. If the network had been congested or slow, we’d expect to observe that the service component was responding quickly whilst Locust was responding slowly.

We can now look at other metrics recorded by the service component to trace the cause of the degrading response times to find the root cause. We can see the load average and CPU utilisation are quite reasonable — we know from previous tests that response times aren’t significantly affected until instances reach below 10–20% CPU idle. The service uses a number of redis databases for persistence, and we can see from the statistics that redis’ response times are degrading. Because the statistical data is split by tags (cluster, command), we can see all commands to one specific cluster are degrading. Further analysis (of metrics not shown here) confirms that this redis cluster is CPU bound. In this specific example, we use twemproxy to horizontally scale the redis backing store, so the next step would be to add more capacity to that cluster.

Next steps

The main goal of the project is to identify the bottlenecks that could prevent us from reaching target capacity. In some cases these can be mitigated by adding servers to a pool, or to increase the upper bound of auto-scaling groups. In other cases the software and systems need some re-work to remove the pinch point, which is often a single node. The (horizontal) scaling of systems is a significant topic in its own right, so I’ll leave that as the subject of a future post. The project is also providing a significant secondary benefit in the monitoring capability of the platform. Each time a bottleneck is found, we have monitoring in place to detect it. The ideal scenario is for the metric to show some form of early warning signs, which can be fed into the existing automated alerting tools.

Tom Dalton, Senior Python Engineer and Part Time Load Tester