Load Testing an API Gateway

Sneha Narayana Swamy
Just Eat Takeaway-tech
9 min readJan 6, 2022

We have a lot of independent, maintainable microservices owned by the teams across JustEat. An API gateway sits in front of these microservices as an infrastructure layer which routes the traffic from the clients to the right microservice. The main goal of the API gateway, which we call Smart Gateway, is to route traffic in a secure way while providing scalability and stability. An API gateway reduces the possibility of security vulnerabilities as it provides restrictive access to our microservices from the outside world and also protects the service instances by load-balancing the incoming requests across all the available instances. We rate limit the services so that we avoid unexpected spikes in traffic, which helps protect our platform from malicious DDOS attacks.

Our team implemented and continuously maintains the Smart Gateway for microservices across all of JustEat. We handle around 300 million requests per day. As the team that is responsible for the entry point to the JustEat platform handling all the calls from consumer apps and websites, it is crucial for us to make sure that our platform is fast and efficient. We want our applications to be highly responsive, our API gateway to operate with as little latency as possible providing high throughput and in turn provide the best user experience for our customers. One of the factors that ensure our platform can handle the huge amount of traffic is to have a reliable suite of load tests that provide crucial data, helping us analyse the system behaviour under various loads which helps us track down any performance issues before reaching the production.

In this article, we are going to discuss

  • How we stress-tested our infrastructure to find the breaking point and set our thresholds and scaling policies using the load generator tool, Hey
  • How we created a load test suite for continuous integration and regression testing using Locust, a scriptable and scalable load testing tool

Before we started our testing, we defined our scope to test everything that is part of Smart Gateway infrastructure owned by our team and not the entire JustEat platform. We focused on testing if Smart Gateway could route the traffic and determine at what point the system stops coping with the expected load.

To have a production-like setup and still isolate the tests from calling the real backend services, we

  • created a test/fake API, which we call FakeBackend, to be used as the backend service. This API returns 200 as a status code and generates test response data.
  • point the Smart Gateway to Fakebackend by replacing the services host with the FakeBackend host in the Smart Gateway configuration, only for the duration of the test run
  • ensured FakeBackend instance is well scaled so that it will not be a bottleneck during the test run

This forms the initial setup step and FakeBackend will be detached from the Smart Gateway at the end of the test run

Stress testing with Hey

There are a lot of load testing tools with their own pros and cons but we chose Hey mainly for its simplicity, efficiency, and ability to generate a huge load. The main aim here was to run the tests with an expected load and ramp up the load iteratively until we saw the error status codes.

The tests were run with 3 instances of fake backend and 1 instance of the smart gateway. We were finding the threshold on one instance of smart gateway because the database load is minimal and we knew that the limiting factor will be the instances, not the database, and used the resultant data to set the scaling policy. We created a test upstream service in Smart Gateway with the host set to FakeBackend. We sent Hey requests to the Smart Gateway from a powerful EC2 instance.

./hey -n 1000000 -c 300 -m GET -T “application/json” http://smartgatewayHost/testService

We ran the tests with 300 concurrent connections and ramped up the total number of requests each time, in increments upto 50 million requests.

To analyse the results there are a lot of metrics that can be used to measure the performance of a system but we mainly focus on the below-mentioned metrics:

Status code — we expected the status code to be 200 in this test run which would indicate a successful response. As we were using FakeBackend which returns 200 for all the requests, a status code other than 200 means that we have errors somewhere in the API gateway or the system has reached the breaking point at the specific load.

Latency — This indicates the time required to get a response from the upstream backend service since the time a request was made by the client. This is really important as it gives us an insight on the additional latency added by the inclusion of API gateway. We want to keep this value as low as possible. We check the 99 percentile latency which means that 99% of the requests made returned back with certain latency. This indicates that we can serve the majority of our consumers with the latency identified in the 99 percentile metric.

Throughput — This is the number of requests the gateway can handle per second. We want this value to be as high as possible because the more the number of requests we can handle, the more the number of users we can support at the same time.

CPU load — CPU utilisation by the instance is important to note as we want our instances to have stable CPU utilisation while handling a high volume of requests. This indicates that we have the right instance size that can handle our requirements.

Through these test results, by analysing the metrics mentioned above, we could gauge the performance of SmartGateway and define our thresholds. This has given us confidence on the high volume of traffic we can handle on our platform without any disruptions.

Locust Load Tests

The Hey tool was easier to use in terms of generating load and analysing results which fit perfectly for our exploratory testing and benchmarking. But we needed a tool that would give us more flexibility in:

  • Adaptability and easy maintenance of tests whenever there are new changes in our smart gateway configuration — We have hundreds of routes that go through Smart Gateway that use different plugins and configurations. We wanted to load test all of them to ensure any changes don’t break or affect the performance of any route. When we have routes at this scale, it is hard to maintain the test cases in the test suite manually with the aim of keeping the test cases in line with the production configuration.
  • Easy integration with our powerful in-house load test orchestration tool, Rambo which manages the test configuration, execution and scheduling.

Locust as the testing tool was the obvious choice as it’s easy to write and maintain the tests without any complex GUI like with JMeter. We also use Taurus as the automation tool which is an abstraction layer on top of Locust which helps control the test runs with various test inputs through config yaml files. And these two tools integrate well with our Rambo framework

We had several iterations of our test framework, using Locust, for better test coverage and easy test suite maintenance. One of the biggest challenges we had was keeping the test cases in line with the changes to the routes and services that go through smart gateway, which was manual and tedious. Locust as the load testing tool proved to be the best choice when we planned to generate the test scripts at runtime from templates so that we can get rid of the manual effort to maintain these test cases.

  • After every change is merged into the master branch, a test Smart Gateway configuration is generated and the configuration file is uploaded to Artifactory which will be used in the setup step of the test run. The only difference in the test configuration is that the services are pointed to the fake backend.
  • We created a .net application that takes in the Smart Gateway configuration file and Locust test files template to generate the test python files. A binary of this application is stored in the artifactory which will be pulled during the setup step of the test run to generate the test files.

An example Locust test file handlebar template is given below. Each test case in the template is a call to a specific route. The test cases will be created from the routes registered in the smart gateway configuration.

from locust import HttpUser, TaskSet, task, events
import requests as req
@events.test_start.add_listener
def on_test_start(**kw):
print (“Setup Steps”)
@events.test_stop.add_listener
def on_test_stop(environment, **kwargs):
print(“Teardown Steps”)
{{ #each Classes }}
class {{ Name }}(TaskSet):
{{ #each Cases }}
@task
def {{ Name }}(self):
self.client.{{ Method }}(
“{{ Path }}”,
verify=False,
headers={“Content-Type”: “application/json” },
{{ #eq Method “post” “put” “options”}}json={“data”: “test”},
{{ /eq ~}}name=”{{ Name }}”)
{{ /each }}
{{ /each }}
class LoadTestUser(HttpUser):
tasks = [{{ #each classes }}{{name}}, {{ /each }}]
min_wait = 10
max_wait = 15

Test Suite Components

Putting all the pieces together, our test framework has a Rambo config file, Taurus config files, setup & teardown scripts and a Locust test file generated at runtime.

Rambo config .yaml — Schedules and orchestrates the test run and mentions which Taurus config to run for the profile.

---
owners: team1
plans:
— environment: qa1
tenant: uk
profile: adhoc
cpu: 2048
memory: 4096
testEngine: taurus
params:
# The Taurus config file that needs to be run
command: test.yaml

Taurus config file — This is where we make use of Taurus’s shellexec module to run our setup steps (i.e, scaling FakeBackend instances, applying test smart gateway configuration and generating the load test file ) and then execute the tests.

---
settings:
env:
smartGatewayHostAddress: http://test.com

services:
- module: pip-install
packages:
- requests
- urllib3
- boto3
- module: shellexec
startup:
- command: python3 ./aws_setup.py qa1
- command: sh apply-test-smartgateway-config.sh
- command: sh download-loadtest-generator-binary.sh
# Generates load_test.py test file
- command: ./SmartgatewayLoadTestGenerator
shutdown:
- command: python3 ./aws_teardown.py qa13
post-process: echo $TAURUS_STOPPING_REASON

execution:
- executor: locust
concurrency: 10
ramp-up: 1m
hold-for: 1m
scenario: LoadTest-example

scenarios:
LoadTest-example:
default-address: http://test.com
# Run the load test cases from the load_test.py file generated in the setup step
script: load_test.py

included-configs:
- /bzt-configs/base-config.yaml

AWS helper scripts — These scripts that use AWS Boto3 SDK for Python, help to set up the test environment by scaling up and scaling down the FakeBackend instances to ensure the FakeBackend is not the cause of any errors or performance issues.

Locust Test File — The python script with the test suite is generated at runtime in the setup steps (Refer Taurus config file example above)

With this setup in place, we use our Rambo framework to orchestrate the ad-hoc and scheduled test runs. Rambo also integrates with our logging and monitoring systems which provides us with detailed stats and logs post-test run.

In summary, we have proved our ability to serve millions of users with quick response times and have our test suite automatically include additional services as they’re added to the gateway, ensuring it’s always reflective of production traffic. With the confidence that comes from these load test results, we are always ready to provide our customers with a smooth food ordering experience.

Just Eat Takeaway.com is hiring. Apply today!

--

--