Intentional denial of service — Load Testing

Emanuel Velho
Jan 12, 2018 · 8 min read

At Unbabel, we find ourselves with the power to change the world. As incredible as it sounds, we must first ensure that we don’t end up destroying it in the process. Like Peter Parker’s uncle once said, “with great power comes great responsibility”.

Everyday we create and enhance systems and applications which must not only work with all that has come before, but become even more performant and robust overall. While we can have unit test coverage, perform in-depth end-to-end system integration and exploratory tests to ensure that our changes are functionally correct, we need to know that we’re on the right track in producing a scaleable solution, and not just making a different version of what came before.

The path to perfection is long and never-ending, but to better understand where we are along it, we use Load Testing. A non-functional test technique, it allow us to assess how an application behaves under a high volume of transactions, as if all your users started making requests to the system at the same time. This way we can stimulate errors in a controlled environment, validate whether our applications can handle random peaks of high usage, and if they degrade gracefully, or succumb to the load and not-so-gracefully crash.

At Unbabel, we start by identifying which services may constitute bottlenecks in our pipeline; either because of their extreme complexity, or because they are known to be resource incinerators (CPU, RAM, Disk IO, etc.).

When we identify a service that falls into one of these categories, we start with some hypothesis for analysing it. Let’s make a hypothetical (and more simplistic) makeover of a service that receives 2 sets of data as input. The makeover bit isn't important here, so we will skip over those details. This service must return a response in less than 300 ms and we estimate that it should be able to satisfy 400 requests per second.

In terms of tools used, we chose JMeter, an open source tool from the Apache Foundation which we use to load, measure and analyse the performance of our services.

We start by creating a test plan that will be used to validate our hypothesis and later to compare results.

On JMeter:

Did you notice that we added and on our input request? This is how we make our test plan data-driven. Let’s create a CSV file on the same folder of our test where we set different values for each test scenario.

This file has 3 columns: set1 and set2 to send as input and the result to assert the request’s response.

Test Plan Listeners

Please note that these listeners are available on the “Graphs Generator Listener” plugin, so you have to install this plugin either by installing in the JMeter’s plugin manager or by downloading it on the plugin page, unzip it and drop it on jmeter folder “jmeter/#version/libexec/lib/ext/”.

Now we are ready to run the test!

We start by analising the results with the Summary Result Listener:

We ran 100K samples with no errors, with an average response time of 347 ms. The standard deviation on response time is around 23ms.

In JMeter, Throughput is the ratio Requests/Time and in this report we can see the average throughput is 270 rps (requests/transactions per second).

Although this already hints that our service doesn’t satisfy the performance requirements, we should check our Over Time results so we can understand better what’s wrong and find out what to improve.

In the Transactions Per Second Over Time listener, we can see that the throughput is not evenly distributed. Despite being able to reach the 300 rps mark, we were not able to keep it steady, which means we are not able to process the required throughput.

As the name indicates, this represents the time that was spent connecting, processing and receiving the requests’ overtime. We can check that response times are steadily above 300 ms, which means we are also not complying with the requirements.

This listener is used to understand the effort that an additional concurrent user brings to our application and to check when our performance starts degrading. We may accept that a higher number of concurrent threads/users may lead to a small increase of the response time, but ideally, it should be steady with almost no oscillations. In our results, we swing between 320 ms and 365, which is acceptable but indicative that we need to get our service faster.

This graph illustrates the time elapsed until we connect with the server, including SSL handshake, per request over time. Connect Time can impact significantly the response time but it also depends on other factors such as our internet connection.

As we can see, it takes at most 25 ms to connect to the server, so it looks like we are not facing any problems on this chapter.

As we didn’t have any errors on our requests, this graph is the same as the Transactions per second, but the Response Codes per second is useful to assess that our system can easily recover in case of an error. If we observe that after an http error code like 400’s, the number of code 200’s decreases, we should take a look on how we recover from errors.

After analysing our code, we found out that we made it overly complex and we were able to simplify our service and enhance our code to be more efficient.

After assuring that the service was functionally as expected, we need run the load tests again to check if our refactor works as expected.****

So let’s check the results:

This is our after the changes:

Now the throughput average is at 450 rps and the average response hits the 200 ms with no errors at all.

So we’re compliant with the requirements! Aw yeah!

Still, let’s assure we’re still not sure and look further.

This is the current behaviour of our service on load. We managed to perform the same 100K request on 3:40 min.

As this is an academic example, we didn’t stretch up the period under stress, but in real life we should. The easiest way is to do it is to go to the test plan’s Thread Group and increase the loop count either by setting a longer time period to continuous loop, or simply by setting a higher number on loop count.

With the exception of 3 outliers, we managed to keep our response times well under 300 ms. So this is also a check on our list.

Continually increasing over time is a strong sign that our service has a bottleneck that needs to be fixed.

We’ve enabled our server to handle 100 concurrent users in under 300 ms. Still, we could try to understand at how many users we start having a hard time managing. To do this, we need to increase the thread number and the ramp up period on our thread group configuration.

Connect times were also as expected, close to 0 with a few acceptable outliers.

If in your tests you get high Connect Times in relation to the total Response Time, try to ping other servers. If the ping time is significant, it may indicate that your internet connection has impact in your Connection Time, so you should also check your connection.

As expected, due to no errors, we ended up again with a graph similar to the Transactions per Second graph. The less colourful this graph is, the happier we are.

We covered some of the most common aspects of load tests:

  • Assessing non-functional requirements as minimum requests per seconds, peak usage and average response time on Transactions Per Seconds, Response Time Over Threads and Response Times Over Time;
  • Looking for bottlenecks on our services by analysing Response Times Over Threads;
  • Determining if our tests are being affected by poor internet connection with Connect Times Over Time;
  • And finally, checking if our system can handle errors and is able to recover with Response Codes Per Second.

Emanuel Velho
QA Engineer @ Unbabel

Unbabel R&D

A collection of articles from the Unbabel Research & Development Team.

Emanuel Velho

Written by

QA Engineer @ Unbabel

Unbabel R&D

A collection of articles from the Unbabel Research & Development Team.