Episode 3: The one with resiliency.

Concurrency and Distributed Systems: A guide with Kubernetes

Published in

Concurrency and Distributed Systems: A guide with Kubernetes

6 min readJun 4, 2022

In the previous episode we used K8s replicated deployment to deploy two instances of the service. The replicated architecture improves availability of the service and it’s a more resilient and fault tolerant architecture.

In this episode(and perhaps a few upcoming ones) we want to investigate the things we can do to make our distributed system architecture resilient to failures and issues.

Replicas

How many replicas are right given a set amount of hardware? This is actually an important question which could affect the final performance. Why is that?

More replicas is not always good and could negatively affect performance as we will see.
Too few replicas means less availability and hence less resiliency.
Too many replicas could contribute to too much network traffic.
Too many replicas could contribute to too many CPU context switches if multiple instances are hosted on one machine.

Let’s explore the last bullet point in action. In episode 2, we chose two replicas for our service and assigned 5 cores to each. What if we increase the number of replicas with the same amount of hardware? 5 replicas with 2 cores for each of them. We don’t expect the performance to improve, because we cannot create performance out of thin air. But, could performance degrade with this new replicas configuration?

spec:
  replicas: 5
...
            requests:
              cpu: "2"
              memory: "1G"
            limits:
              cpu: "2"
              memory: "1G"

Running the scale test with the same workload and 15 QPS yields the following results:

http.codes.200: ............................. 1500
http.request_rate: .......................... 15/sec
http.requests: .............................. 1500
vusers.session_length:
  min: ...................................... 80.6
  max: ...................................... 10583.2
  median: ................................... 963.1
  p95: ...................................... 4065.2
  p99: ...................................... 6312.2

Comparing to results from previous episode with two replicas, we have:

Execution time comparison between 2 vs 5 replicas (in milliseconds)

As we can see in the above table, the performance has degraded by at least 100 percent for 95th percentile. That is not good. Why is it though?

When you increase the number of replicas on a single machine, you have to consider that each of them is a separate process in its own container. There is scheduling and context switching cost that we will have to pay at the OS level as well as K8s level. That cost manifests itself in terms of time and affects the performance.

The take-away lesson is that, we need to allocate enough CPU cores to each replica and choose the number of replicas carefully. This is only one example of why more replicas are not necessarily good. Another example is when you have a server-workers architecture where you have a bunch of workers that are talking to a server or coordinator to get a job done. More worker nodes could contribute to artificial load on the server and could keep the server busy which then could adversely affect the clients.

We need to find that sweet spot for the right number of replicas based on:

Application architecture
Hardware resources
Workload

Make sure to carefully think through the auto-scaling scenarios as they could become your enemy in the production instead of your friend. We will talk about Auto-scaling and its implications in a future episode.

Throttling

In previous episode we saw that increasing the QPS to 20 caused the Prime Sum server to really struggle to keep up and 95th percentile increased by more than an order of magnitude. It went from 1400 ms to 26000 ms! That is not good for any of the users of the API. If we cannot serve at that high of a QPS, perhaps we should not allow that to happen. To tackle this problem, API and system designers usually resort to throttling to control the rate of input requests.

Throttling can be done in many places in the system: Gateway, Load-balancer or inside each of the service nodes and even in K8s(See this article for more info on this) as well:

In this episode we will see how we can implement a simple throttling strategy in the service layer(service nodes) to prevent bad players or attackers or simply well-intended users who don’t know that we cannot handle more than 15 QPS. This will help with the resiliency of our service in the following ways:

Reduces the queue backup in each process.
Reduces the context switching between two many tasks on the server side.
Improves resiliency of the server against possible DDoS attacks.

Throttling can be applied:

Globally on the whole service
Per user
Per IP address
Per API endpoints
Based on quota
Or a combination of any of the above

So, this topic would deserve its own whole set of episodes perhaps, but I’ll do my best to cover the basics here.

Alright, now the goal is to get practical and create a simple throttling mechanism in our beloved Prime Sum service. Let’s get into that.

Throttling in Action

We saw that our service can only handle 15 queries per second given the hardware we have. We had two replicas. So we can deduce that each replica can handle 7.5 requests per second. This number can change based on the given hardware and workload. Therefore, we should make sure that the throttling threshold is configurable.

One simple throttling strategy that we can implement is as follows:

Track how many requests are in flight(we are already doing this in the code)
Implement a guard in the API endpoint to quickly respond back with a 429 HTTP throttle code if the number of in flight queries are above a certain threshold.
Make sure the threshold is configurable.
Communicate this to the API users. This step is a formality here, but in real business cases, we need to make sure to announce throttling changes to the users. This will make sure they account for throttling in their code and implement the right retry mechanisms.

I am going to make the appropriate changes for throttling and scale test the new service. I might have to play around with the throttling threshold to make sure we are not throttling when we don’t have to as long as we are complying with the SLA.

Thankfully, implementing this in Spring framework is fairly simple. All we have to do is to check the current in flight amount with the threshold and throw an exception with the right HTTP status code if we are above it like so:

Throttling bit

Now, we need to set the environment variable THROTTLING_THRESHOLD in the deployment configuration like so:

Throttling configuration for K8s

I chose a very conservative value of 10 for the threshold. You have to realize that this threshold can actually be higher than what the process can handle(The nominal 7.5 QPS). With this threshold value we will run the scale test and:

http.codes.200: ..... 1483
http.codes.429: ..... 17
http.request_rate: .. 15/sec
http.requests: ...... 1500
http.response_time:
  min: .............. 11
  max: .............. 1958
  median: ........... 550.1
  p95: .............. 1326.4
  p99: .............. 1652.8

As can be seen from above, 17 out of 1500 requests were throttled. This is not terrible, but unnecessary. If we increase the threshold to 20(for 15 QPS test), we should see that the throttled requests will disappear. I’ll leave that to the interested readers to try.

But now for the actual part which is scale testing at 20 QPS with throttling. Previously we saw that without throttling, at QPS=20, the median execution time was about 21 seconds. Now here are the results with QPS 20 and throttling threshold set to 20:

http.codes.200: ........ 1723
http.codes.429: ........ 277
http.request_rate: ..... 20/sec
http.requests: ......... 2000
vusers.session_length:
  min: ................. 3.6
  max: ................. 5420.2
  median: .............. 1300.1
  p95: ................. 3752.7
  p99: ................. 4492.8

As can be seen 277 requests out of 2000 requests have been throttled. That is 13.85%. That is not too bad considering:

We could keep the execution time for the remaining 86% of users within the SLA.
Server is not over-utilizied and is stable.
Service is resilient to possible client misuse or abuse.

Those are actually huge wins we get with throttling. The ~14% of throttled users will come back later when the server is less busy.

This concludes this episode for now. But as you can imagine, there is a lot more to be covered in this area(and we will cover them). Fairness and starvation are two important things when it comes to throttling and they are no easy feat. In addition, there is more to resiliency than just replica counts. We are just taking some baby steps :)

Links

Episode 3 code

Episode 3: The one with resiliency.

Concurrency and Distributed Systems: A guide with Kubernetes

Written by Amin Shali