Ready, Steady, Crash!

My load testing journey …and how it came crashing down!

I was tasked with testing the technical limits of our node application. Specifically, we needed to know the limits of the application, whether it would crash under strain, and if so would the user experience crash gracefully or cause absolute meltdown. The answer, I heard, could be found using load testing.

This blog post will cover the core concepts behind load testing. It will look at what it means to ‘load test’ your application, how to get started with writing your first script and explain how to use the metrics to identify a performance issue within your application.

It will also provide some useful tools and techniques that can be used to help solve a performance issue once identified.

What exactly is load testing? ¯\_(ツ)_/¯

A colleague of mine recommended an excellent talk by ‘Jad Meouchy: Load testing your node application’.

He described load testing in simple terms “It is about simulating ordinary user activity and then applying stress until failure.”

Load testing the Biz Ops API

STEP ONE: App Overview

The purpose of the Biz Ops Api is to provide a central store of all business information at the FT.

Image 1.1 illustrates the Biz Ops model and the type of information that is going to be stored within the database. As you can see, it will include information about our systems, how they relate to one another and how they relate to teams and people who are working at the FT.

Image 1.1 The Biz Ops Model

STEP TWO: How can I apply this so called ‘STRESS’ to my application?

First, I needed to decide on a testing framework.

There are many libraries out there, including popular ones like Apache Bench, Locust, Artillery.io and Wrk. I decided to go with Artillery.io because it allows you to custom JS code within your scripts, it’s targeted towards continuous integration (CI) and it provided detailed metrics as I went along. The scripts can be configured using YAML and it can be installed using npm. So, for me, it was really a ‘no-brainer’ if I wanted to get started quickly.

npm install artillery — save-dev

Next, I needed to think about the type of requests that would represent ordinary user activity. My scripts needed to mirror actual user behaviour. The application is going to be the central store of all business information at the FT, so I figured a good starting point would be to investigate the load of read queries as users will need to be able to retrieve this data.

I needed to consider the type of load that would be hitting my application. As it’s still in the development stage, it was difficult to predict the number of concurrent users and how they would ramp in over time. However, this was crucial if I wanted to get my test to be as robust as possible. I therefore decided to split my script into four separate phases, each capturing a different type of traffic that might be hitting the application:

  • Warm up — this is the arrival rate of 10 virtual users/second that last for 60 seconds.
  • Ramp up — this is where we go from 10 to 25 new virtual user arrivals over 120 seconds.
  • Cruise — this is the arrival rate of 25 virtual users/second that lasts for 1200 seconds.
  • Crash — this is the arrival rate of 100 virtual users/second that lasts for 30 seconds.

STEP THREE: What should my first script look like?

As you can see below, I have written my first script in order to test the load of read queries on the Biz Ops API.

The script’s written in YAML and has been split into two main sections: config and scenarios.

The config section is where you choose the target of your load test (the address of the API server under test), specify any plugins that you would like to use, and define the load phases.

The scenarios section is where you identify what the virtual user’s behaviour is going to be during your test. I’ve chosen four common read queries.

config:
target: 'https://biz-ops-staging.api.ft.com'
plugins:
statsd:
host: localhost
port: 8125
prefix: load.
phases:
- duration: 60
arrivalRate: 10
name: 'Warm up'
- duration: 120
arrivalRate: 10
rampTo: 25
name: 'Ramp up'
- duration: 1200
arrivalRate: 25
name: 'Cruise'
- duration: 30
arrivalRate: 100
name: 'Crash'
processor: "./setup"
scenarios:
- name: "GraphQL query that returns a list of all the systems and their corresponding properties."
flow:
- post:
url: "/graphql"
beforeRequest: "setHeaders"
json:
query: "{ Systems { code name description primaryURL systemType serviceTier serviceType hostPlatform containsPersonalData containsSensitiveData lifecycleStage SF_ID } }"
- name: "GraphQL query that returns the url of all the healthchecks that are live."
flow:
- post:
url: "/graphql"
beforeRequest: "setHeaders"
json:
query: "{Healthchecks (isLive: true) { url }}"
- name: "GraphQL query that returns a list of all the teams that are active and their corresponding properties."
flow:
- post:
url: "/graphql"
beforeRequest: "setHeaders"
json:
query: "{ Teams (isActive:true) { code name email slack phone isActive supportRota contactPref }}"
- name: "GraphQL query that returns a list of systems that have a Bronze service tier and their corresponding properties"
flow:
- post:
url: "/graphql"
beforeRequest: "setHeaders"
json:
query: "{ Systems(first:600, serviceTier:Bronze) {code name description primaryURL systemType serviceTier serviceType hostPlatform containsPersonalData containsSensitiveData lifecycleStage SF_ID}}

Once you have the script setup, you can run the load test by using the following command:

artillery run scripts/load-testing/readQueries.yaml

STEP FOUR: Great, I’ve got the metrics. But what do they actually mean?

Once you have run the above command and the load test has completed, Artillery will provide you with a summary report which is shown in Image 4.1.

Image 4.1 The Summary Report of the Load Test

As you can see, Artillery provided us with detailed metrics that helped analyse the behaviour of our application when under load.

  • Scenarios launched: the total number of completed sessions.
  • Requests completed: the total number of HTTP requests and responses or WebSocket messages sent
  • RPS sent: the average number of requests per second completed throughout the test
  • Status codes: the HTTP status code responses from each request that has been completed.
  • Request latency are in milliseconds, and p95 and p99 values are the 95th and 99th percentile values.

We noticed the request latency had started to degrade rapidly during our performance test. It began with a minimum of 67.6ms, and then suddenly ramped up to reaching a median request latency that exceeded 30000ms. This was concerning as we were only sending on average 20 RPS, and we were expecting our application to be able to handle a much higher number of concurrent users.

Moreover, out of the 6000 scenarios that were launched and completed, more than 90% of requests received an HTTP status code 5XX- indicating that the server was aware of an error or it was simply incapable of performing the request.

STEP FIVE: How do we find the culprit?

The metrics about the response times, throughput rates and HTTP status codes helped identify a serious performance issue. What was the cause of the problem, and how could we find the breaking point?

We ran several tests, singling out each query and sending the metrics to Graphite and Grafana. After a couple of attempts, we had found the culprit. We were hitting 100% CPU when running the following query:

query: “{ Systems(first:600, serviceTier:Bronze) {code name description primaryURL systemType serviceTier serviceType hostPlatform containsPersonalData containsSensitiveData lifecycleStage SF_ID}}”

at a rate of 10 RPS (requests per second).

Image 5.1 Graph showing CPU usage and system load after running our load test

STEP SIX: How do we go about solving this and is it an application issue?

The CPU is the central processing unit in the computer that does all the computational work. Therefore, the fact that we were hitting 100% CPU utilization with a very low level of RPS led us to believe it was an application issue. We wanted to determine whether there was something in our code base that was causing the CPU to reach 100% utilization. As a result, we generated a Flame graph — using the plugin 0x — in order to profile CPU usage. With a Flame graph, all the data is on the screen at the same time, and the hottest code-paths are immediately obvious as the widest functions. Annoyingly, there was nothing unusual being displayed. There was no single line of code using up the whole CPU usage that we could try to fix. So, we needed to think about what else could be causing the performance issue.

Image 6.1 Flame Graph generated using 0x- profiling CPU usage

STEP SEVEN: Could it be the Neo4j driver?

Our database is a Neo4j instance run by GrapheneDB. When a user makes a read query using GraphQL, it will hit an execution layer (an npm module called neo4j-graphql-js) that will translate the GraphQL query to a Cypher query.

For example,

Systems(first:600, serviceTier:Bronze) {code name description primaryURL systemType serviceTier serviceType hostPlatform containsPersonalData containsSensitiveData lifecycleStage SF_ID}

would become

MATCH (system:System {serviceTier:”Bronze”}) RETURN system { .code , .name , .description , .primaryURL , .systemType , .serviceTier , .serviceType , .hostPlatform , .containsPersonalData , .containsSensitiveData , .lifecycleStage , .SF_ID } AS system SKIP 0 LIMIT 600

This would then send the request directly to our Neo4j instance.

We decided to look into the configuration of the neo4j-graphql-js and see whether we could identify a performance issue there.

The driver had a maximum pool size of 100 connections. This meant that if a session tried to acquire a connection but the pool size was at capacity, it would have to wait until a free connection was available or the request would timeout, causing a HTTP status code 503. This seemed to suggest that we could face the performance issue if the number of connections were exceeding the limit when running our script.

Therefore, we increased the pool size to 300 and decreased the retry timeout to 10 seconds to see if this would resolve the performance issue. However, we were still not hitting the capacity of 100 connections and so increasing the pool size had no impact. Great.

We then tried upgrading the version of the Neo4j driver from 1.3 to 1.6.2 to make sure that we had an up-to-date driver that was stable and performant.

We then discovered that the queries were not being parameterized correctly when using the driver. This would have a significant impact on the performance of the query as it would not allow the caching of statements within the cypher engine. This means that the query plan would not be able to be reused and this would drastically hinder the performance of a particular query. Once this was corrected, we thought it would be the ‘fix’ but we were still not even close to solving the issue. We had ruled out an application issue and the Neo4j driver, so what next? Perhaps, it was something to do with the query itself?

STEP EIGHT: Cypher tuning, what is it?

Thankfully with Neo4j, you are able to see how a particular query performs by prefixing it with PROFILE.

As you can see below in Image 8.1, the query performs a NodeByLabelScan, followed by a filter and then a project, which gives us 14,213 total db hits.

A database hit is an abstract unit of storage engine work and could correspond to looking up a node, a property or a relationship.

The objective of query profiling is to find a way in which you can reduce the number of database hits required to perform the task whilst keeping the output level constant.

One way to reduce the number of database hits is to create an index on the starting point of your query. Therefore, we added an index on the serviceTier attribute in the hope that it would improve performance.

CREATE INDEX ON :system(:serviceTier)
Image 8.1 Running PROFILE on the query without an index
Image 8.2 Running PROFILE on the query with an index

As you can see above in Image 8.2, the number of db hits reduced by 2,148. However, this did not resolve the performance issue. We went back to the docs to find more information on improving the performance of the query.

We came across the topic of compressed storage. It explained that Neo4j will try to classify strings in a short string class, and if it manages that then there is no string record needed to store the property. This means that it can be read and written in a single lookup, leading to performance improvements and less disk space required.

We removed long string attributes such as monitoring, architecturalDiagrams, troubleshooting and descriptions where attribute values had string characters greater than 200 in order to determine whether this would have an impact on the query. Again, we saw little, if any, improvement.

STEP NINE: It must be a configuration issue with the database…right?

As the cypher tuning didn’t fix our performance issue, we decided to look at the the configuration of our Neo4j instance. We started by running :systeminfo in the Neo4j browser which provided us with information about the size of our database. For example, the number of nodes and the number of relationships that exist in our Neo4j instance.

Image 9.1 Running :systeminfo in the Neo4j browser

From this, we could then use the Neo4j hardware sizing calculator. It recommended that we should increase the size of our total RAM and also the number of cores we had on our server. Perhaps, this could explain why the CPU was maxing out, so we enquired about the different plans available for our Graphene DB instance. We also noticed that the current plan (Standard 1) was hosted on a shared resource. We wanted to see if we made the required changes whether or not this would impact the performance issue, or at least stop the CPU reaching capacity when running our test.

Image 9.2 Neo4j hardware sizing calculator output

So, we upgraded the plan to Performance 1 and ran the scripts again. Confusingly, increasing the GrapheneDB instance size to Performance 1 with 8GB of RAM, 2 cores on the server and having a dedicated resource had actually reduced the overall performance.

This led us to be believe that it could be a network I/O bound issue. Perhaps, when the database results are queued for processing by the CPU, if the I/O is limited this queue of database results will back up, eventually causing a struggling CPU.

STEP 10: What was the fix?

Unfortunately, there was no ‘obvious’ fix. The problem we had been trying to solve for over a week or so had all of a sudden vanished. From one day to the next, we were able to make 25 RPS, with the median request latency having significantly decreased to 999ms. Brilliant.

So, why have I shared this arduous journey with you?

The Key Takeaways

Part of being a Junior Engineer (or even a Programmer) is that sometimes, the solution to a problem isn’t always as straightforward or obvious as it might appear. The important thing is to be aware of the tools available to identify an issue, becoming comfortable with their use and being able to apply them with confidence. The process is a continuous one and our troubleshooting skills improve in line with number of tools we have managed to include within our own personal toolboxes.

This journey was, at times, incredibly frustrating, but it provided me with a real insight into the importance of load testing and the business implications that can arise as a consequence of not running a performance test on your application.

More importantly, I hope that in documenting my journey, this blog post can act as a starter guide for those new to load testing and will provide you with some key takeaways. In particular, what it means to ‘load test’ your application, what your first script should look like and the type of decisions you might need to make along the way.

I hope it has also provided a basic introduction to the meaning of the metrics and their interpretation as well as introduce some different tools that can help identify and troubleshoot the performance issue. For example, how sending the metrics to Grafana and Graphite can make visualising performance over time and identifying bottlenecks a lot easier.

Even though it didn’t lead to finding the culprit in this case, my journey has provided me with several new ways to diagnose our applications and that will help me identify and eradicate future culprits.

Whilst my load testing journey may not have been an easy one it was certainly rewarding. It gave me another vital tool to include in my Programmer’s toolbox and in sharing my experience I hope it will allow you to do the same.