How we scaled a Data microservice on Kubernetes

The story of how our data team performed load testing to validate the scalability of one of their key microservices.

Tanakorn Kriengkomol

Published in

Vestiaire Connected

7 min readMar 30, 2023

Kubernetes services | Growtika via Unsplash

What is load testing?

Load testing is a performance test that focuses on measuring a software’s response under different real-world load conditions.

This phase is highly important in the lifecycle of a microservice. It is the main piece of the puzzle that can ensure that a software will handle the load as expected in reality.

At Vestiaire Collective, Predator is our official tool for all load tests, and it’s maintained by our Vestiaire Collective platform team. It’s a powerful, flexible tool that we leverage to perform unlimited tests at low cost.

In this article, we’re going to walk through each load testing step of our CRIME service. CRIME is an in-house software dedicated to flagging counterfeit products. Recently, we developed a new Machine Learning model and implemented it into CRIME.

The goal of this post is to share the strategy and learnings that came along with the load testing process of CRIME, so that you can better understand how we make sure that all of the microservices we ship to Production are scalable.

At the end of the performance improvement phase, we wanted CRIME to:

1. Be able to handle 50 RPS.
2. Have a p95 response time < 1 second.

A little context

In the next sections, we will mention elements that play a role in the CRIME microservice architecture, namely Snowflake, DatAPI, Pricing service, DataDog and Grafana. It’s important for you to get a rough idea of why they’re important to better understand the load testing process of CRIME.

Here’s a brief glossary of those elements. Don’t hesitate to come back to these definitions later in your reading!

Since a picture is worth a thousand words, here’s a diagram of the CRIME service architecture.

Let’s put CRIME to the test

The test was done in the production environment by gradually increasing the load on CRIME. This way, we could ensure the smallest possible impact on other microservices relying on it.

You can find the results in the below table.

*At 25 RPS (Requests Per Second), the load caused high latency on the CRIME service and affected other production calls. When checked against other dependent services, here’s what we saw.

Database query latency

DatAPI uses PostgreSQL as a back-end database.

The table that is used for serving features of CRIME doesn’t have an index on the key column.

Spike in CPU usage

CPU utilization of CRIME service went up by a large margin compared to the usual load.

Other metrics of the service during the load test

The spikes in figures 4 and 5 represent the increase in CPU usage for the following load-testing scenarios: 10 RPS, 20 RPS, and 25 RPS (canceled early) respectively. Both Grafana and Datadog pictures show roughly the same period.

What we concluded

CRIME was able to serve at most around 20 RPS for a short time (testing tasks lasted 5 minutes each) and was very sensitive to DatAPI performance.

Possible improvements

Thanks to the various tests, we were able to identify three different areas of improvement that could boost CRIME’s performance.

Add an index on the table in PostgreSQL.

2. Change the configuration of our pods’ CPU and Memory.

3. Increase the number of serving pods for CRIME and DatAPI.

Optimizing CRIME in preproduction environment

Every time we get to the optimization phase, our idea is to get a general feeling of what change will be most impactful. That’s why we decided to optimize CRIME based on two axes: CPU and max replicas.

The tests were all done using the below settings in Predator.

Starting RPS: 10 RPS
Ramp to: 100 RPS
Duration: 10 min

Baseline: Initial configuration before optimization

First, we set the initial CRIME preproduction pod configuration to match the production. It was useful to start optimizing in a baseline environment as close to the production environment as possible. We later tweaked this configuration to improve CRIME’s performance.

Here is the initial configuration of the CRIME pod.

CPU: 200m, 400m
Memory: 500Mi, 1Gi
Min Replicas: 2
Max Replicas: 3
Target CPU Utilization: 70%

Results

RPS maxed out at around 27–30 RPS and caused a bottleneck that made the rest of the requests stagnate. CPU was also maxed out and could not serve more requests.

CPU utilization for each pod — Initial configuration

Optimization #1: Increase max replica pods

We increased the maximum number of replicas from 3 to 10.

CPU: 200m, 400m
Memory: 500Mi, 1Gi
Min Replicas: 2
Max Replicas: 10
Target CPU Utilization: 70%

Results

Increasing max replicas did help to a certain extent but the response time was still too high.

Also, it did not scale up to more than 10 replicas and maxed out at around 6–7 instances.

CPU utilization by each pod — Increase replica

Optimization #2: Increase CPU size

We finally increased the CPU from 200m, 400m to 700m, 1.2.

CPU: 700m, 1.2
Memory: 500Mi, 1Gi
Min Replicas: 2
Max Replicas: 3
Target CPU Utilization: 70%

Results

Increasing CPU size seemed to help much more than purely increasing the maximum number of replicas. With this configuration, we concluded that we should be able to serve requests at a maximum of 50 RPS, which was our target!

CPU Utilization by each pod — Increase CPU

Note

Most of the time within requests was spent waiting for external calls to return their outputs. The performance bottleneck was not caused by the model inference as it took less than 50 ms to complete for most of the calls.

Results of optimization in production

After moving from initial findings to the testing of multiple configurations in CRIME service, we successfully reached the end of our optimization process.

CRIME could now handle loads of 40 RPS, considering that there were no spike loads on external dependencies services i.e. DatAPI and pricing service.

The below charts are load testing results obtained from the production environment.

Most of the time request latency was under 1 second. However, we could still observe high latency spikes due to CRIME spawning new instances, as with the current implementation. This happens because CRIME initialize and set up their models for inference when starting up their containers (cf. cold start behavior).

Final configuration

CPU: 500m, 1000m
Memory: 500Mi, 1Gi
Min Replicas: 2
Max Replicas: 10
Target CPU Utilization: 60%

Conclusion

Although we did not reach the target RPS of 50 RPS, the number we achieved after optimization is good for our planned use case. On the response time side, it should be more than good enough, as most requests were responded within 1 second.

For the CRIME service, most of the bottlenecks were coming from insufficient CPU resources. Increasing the CPU size for each pod really helped scale up the load the service could handle. But as the service still has dependency on DatAPI, we will need to look into how to improve it as a next step to ensure that all our data team services are working well together.

This optimization was only possible thanks to the good tooling available to us. Predator as a load-testing tool gives us a very easy time when iterating on a change and seeing the impact immediately after. In addition, both Datadog and Grafana — for service monitoring — give us a detailed view of the service performance and give valuable insights into where the bottlenecks are.

How we scaled a Data microservice on Kubernetes

The story of how our data team performed load testing to validate the scalability of one of their key microservices.

What is load testing?

A little context

Let’s put CRIME to the test

Database query latency

Spike in CPU usage

What we concluded

Possible improvements

Optimizing CRIME in preproduction environment

Baseline: Initial configuration before optimization

Optimization #1: Increase max replica pods

Optimization #2: Increase CPU size

Results of optimization in production

Final configuration

Conclusion

Written by Tanakorn Kriengkomol