Episode 2: The one with replication.

Concurrency and Distributed Systems: A guide with Kubernetes

Published in

Concurrency and Distributed Systems: A guide with Kubernetes

8 min readJun 1, 2022

Before we get into deploying our service with Kubernetes, let’s ask why are we doing this? To answer that we need to establish a baseline. A baseline for performance and failure recovery.

Baseline

To establish the baseline(or control test), let’s say we only have one machine with 10 CPU cores(that is the machine I am testing with), but we have a similar kind of workload as stated in Episode 1. So we need to support a QPS of 20(or at least close to that). The first thing that comes to mind at this point is multi-threading to take advantage of available cores. We didn’t have that in the bare-bone code we wrote in the previous episode.

Both Java and NodeJS have nice libraries for concurrent programming. Java has Thread Pool Executor and NodeJS has Worker Threads. We are writing this service with Java, therefore we will take advantage of Executors and ExecutorService to create a multi-threaded application. Spring framework has already support for thread pools which makes it very easy to do this.

The general idea is that as requests come in, each will be assigned to a thread to execute. We will have as many threads as we have CPU cores. If we receive more requests than the available CPU capacity, they will be queued until a CPU becomes free.

Doing so is fairly easy with Spring framework. We already marked the return result for our method to be asynchronous. All we have to do is to tell spring framework through a config file that we want to have. The file is application.properties which goes into the resources directory.

spring.task.execution.pool.max-size=10

And we should annotate the service class with @EnableAsync:

@EnableAsync
public class Calculator {
  ...
}

Ton run a single node server simply do:

The shell command to package and run a simple docker container for the service

Scale Testing

To establish the baseline we need to scale test the above setup. For that we have a lot of options. We simply choose artillery for simplicity of use for now.

From the requirement we know that the workload is between 1 and 5,000,000. So we need to generate a fake workload to test the system. To do that we can take advantage of bash’s RANDOM function like so:

for i in `seq 1 10000`; do \
echo "$(( ($RANDOM) * 10 )),$(( ($RANDOM + 10000) * 100 ))"; \
done > ./small-queries.csv

The above command will generate 10000 lines of random values such as:

244280,2836800
42230,1338100
266460,3250500
38370,3423000
127320,2690200
297210,3902100
80470,1089500
296410,3144300
312070,1488000
...

We can then easily feed the above workload into artillery and from there to our service. Our artillery test script is as follows:

The artillery test script

As can be seen in the script, it reads the entries from the small-queries.csv and pass them to the specified API path with the requested JSON format. We can specify our API server endpoint on the command line.

To run the load test, run the following command:

artillery run -t http://localhost:8080 load-test.yaml

And the results are:

http.codes.200: ............................. 1500
http.request_rate: .......................... 15/sec
http.requests: .............................. 1500
http.response_time:
  min: ...................................... 163
  max: ...................................... 2706
  median: ................................... 685.5
  p95: ...................................... 1465.9
  p99: ...................................... 2231
http.responses: ............................. 1500
vusers.completed: ........................... 1500
vusers.created: ............................. 1500
vusers.created_by_name.sumPrime load test: .. 1500
vusers.failed: .............................. 0
vusers.session_length:
  min: ...................................... 166
  max: ...................................... 2711.6
  median: ................................... 685.5
  p95: ...................................... 1465.9
  p99: ...................................... 2231

The above test was run with a QPS of 15(arrivalRate). If we increase the QPS to 20, we will see that the server really struggles to keep up with the load and a lot of queries will be queued.

http.codes.200: ............................. 2000
http.request_rate: .......................... 20/sec
http.requests: .............................. 2000
http.response_time:
  min: ...................................... 330
  max: ...................................... 33730
  median: ................................... 15218.6
  p95: ...................................... 26643.2
  p99: ...................................... 30040.3
http.responses: ............................. 2000
vusers.completed: ........................... 2000
vusers.created: ............................. 2000
vusers.created_by_name.sumPrime load test: .. 2000
vusers.failed: .............................. 0
vusers.session_length:
  min: ...................................... 354.5
  max: ...................................... 33820.3
  median: ................................... 15218.6
  p95: ...................................... 26643.2
  p99: ...................................... 30040.3

The above results are outside the agreed upon SLA and with the current implementation and hardware resources, we definitely cannot achieve 20 QPS and provide an accepted SLA.

But what is the maximum QPS we can provide, with the current system, which could be satisfactory? To answer that we can do a binary search on QPS. We already saw that QPS of 15 was very promising and under SLA, but QPS of 20 was not. So maybe we go with QPS of 17 and see what happens.

Here are the results of the scale testing using arrivalRate of 17:

http.codes.200: ............................. 1700
http.request_rate: .......................... 17/sec
http.requests: .............................. 1700
http.response_time:
  min: ...................................... 247
  max: ...................................... 13127
  median: ................................... 4770.6
  p95: ...................................... 10617.5
  p99: ...................................... 12213.1
http.responses: ............................. 1700
vusers.completed: ........................... 1700
vusers.created: ............................. 1700
vusers.created_by_name.sumPrime load test: .. 1700
vusers.failed: .............................. 0
vusers.session_length:
  min: ...................................... 268.3
  max: ...................................... 13195.3
  median: ................................... 4770.6
  p95: ...................................... 10617.5
  p99: ...................................... 12213.1

The above p50, p95 and p99 times are NOT within the SLA we want. So perhaps QPS of 15 is the best we can do. This will concludes our baseline establishment.

Now, we can move on to the Kubernetes deployment and we should be expecting nearly the same results on scale testing(we need to factor in some overhead due to K8s networking though).

Replication

One of the main pillars of any distributed system is replication. Replication, as the name suggest, is to have multiple copies of the same thing. There are two questions that we want to address here: why we want replication and how we can implement the replication?

To answer why we need replication, let’s take a step back and look at the service we used to run the baseline implementation. This service has only one instance running on a single node. That is a single point of failure. If that node dies for whatever reason(e.g. hardware failure, network or power outage), we cannot serve any requests. That means zero availability for our service. That is a serious problem in the production environment. Let’s say there is a 5% chance that our service fails for any reason at any moment. That means that the probability of our service’s availability is 95%. What if we had the service running on two nodes? What is the probability of our service’s availability?

Probability of both nodes at the failure state = 0.05 * 0.05 = 0.0025 = 0.25%

That means the service’s availability is now at 99.75%. That is actually amazing. Just by adding one node, we could increase the availability by 4.75%!

To further support the idea of replication and why we need it, let’s go back to our baseline scale testing. We saw that we could not support more than 15 queries per second within the accepted SLA. How can we support a QPS of 20 then? We can potentially get a better node with more CPUs and that will probably get us to 20 QPS. However, that design still suffers from the previous problem; the single point of failure. The better thing to do is to get another node and run the same service on two nodes. Now, each node will support 15 QPS and we can support a QPS of 30 in total which is better than what we need for the MVP.

So replication helps our service in two important ways:

Higher availability of the service
Better QPS and SLA(if we add enough hardware)

Kubernetes Deployment

At this point we know why we need replication. K8s will help us with how to do replication. Getting replication right is actually no easy feat. That is why K8s is one of the best platforms out there to help us do it correctly.

I am assuming that you know a little bit about K8s. However, if you don’t know, do not worry. I’ll try to explain the concepts as we need them.

What are some of the challenges when it comes to replication that K8s will handle for us:

Networking and name service: being able to address the service with a single name.
Load balancing requests between back-end nodes.
Reliable updates and restarts: deploying a new version of application or restarting the service in case of failure in a node.

K8s has concept of Deployment. A deployment is a collection of processes which have a single name. Each process is hosted in a Pod. A pod is the smallest deployable unit. It is a logical unit which has some statistically defined containers in it.

A K8s’ deployment will make sure that the state of the pods are maintained according to the spec. For example if you need to have 5 replicas of your service running at all times, K8s will try to maintain that property by monitoring the replicas and making sure to launch new instances in case of failures.

The final concept we need here is the Service. A K8s’ service is a way of exposing a set of pods which are running an application. There are a few different types of service, but here we need a LoadBalancer type. A load-balancer simply balances the load across multiple pods and can be accessed via an external IP address or DNS name.

The following is a YAML file, which declarativly describes how we want to deploy our service in K8s:

2 replicas of a pod with primesum-calculator container in it.
Each pod has 1 GB of RAM and will get 5 CPU cores at the most.
Service, primesum-calculator, is a LoadBalancer hosted on port 80.

The K8s deployment configuration for the prime sum service

I have minikube installed on a Xubuntu Linux machine with 12 CPU cores and 16GB of RAM.

To deploy, first we need to run the minikube cluster:

minikube start
# to enable external IP addresses for load balancers, run:
minikube tunnel

To deploy the service we run:

kubectl apply -f deployment.yaml

After this if we run kubectl get pods,services -n default , we should see something like:

NAME                                       READY   STATUS    RESTARTS   AGE
pod/primesum-calculator-7c54bcbd89-252bw   1/1     Running   0          2m16s
pod/primesum-calculator-7c54bcbd89-fx2r6   1/1     Running   0          2m16sNAME                          TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
service/kubernetes            ClusterIP      10.96.0.1       <none>          443/TCP        12d
service/primesum-calculator   LoadBalancer   10.105.200.98   10.105.200.98   80:31275/TCP   2m16s

We can verify the deployment on K8s by running a simple query:

curl -s -X POST -H "Content-Type: application/json" http://10.105.200.98/sumPrime --data '{"a": 1, "b": 10}'

Scale Testing K8s Deployment

Similar to scale testing the baseline deployment, we will use artillery and the very same test workload for scale testing K8s. To do so, we run the following command for QPS of 15:

artillery run -t http://10.105.200.98 load-test.yaml

And the results are:

http.codes.200: ............................. 1500
http.request_rate: .......................... 15/sec
http.requests: .............................. 1500
http.response_time:
  min: ...................................... 151
  max: ...................................... 3385
  median: ................................... 713.5
  p95: ...................................... 1720.2
  p99: ...................................... 2369
http.responses: ............................. 1500
vusers.completed: ........................... 1500
vusers.created: ............................. 1500
vusers.created_by_name.sumPrime load test: .. 1500
vusers.failed: .............................. 0
vusers.session_length:
  min: ...................................... 164.4
  max: ...................................... 3387.7
  median: ................................... 713.5
  p95: ...................................... 1720.2
  p99: ...................................... 2369

These results are very good and similar to the baseline with the added bonus that K8s is a replicated deployment which offers better availability.

Execution time comparison between baseline and K8s deployment (in milliseconds)

Running the load test with QPS of 20 will yield similar results to baseline test as well which suggests that K8s does NOT magically improve your performance:

http.codes.200: ............................. 2000
http.request_rate: .......................... 20/sec
http.requests: .............................. 2000
http.response_time:
  min: ...................................... 236
  max: ...................................... 57771
  median: ................................... 20958.1
  p95: ...................................... 42205.5
  p99: ...................................... 47586.7
http.responses: ............................. 2000
vusers.completed: ........................... 2000
vusers.created: ............................. 2000
vusers.created_by_name.sumPrime load test: .. 2000
vusers.failed: .............................. 0
vusers.session_length:
  min: ...................................... 239.2
  max: ...................................... 57772.2
  median: ................................... 20958.1
  p95: ...................................... 42205.5
  p99: ...................................... 47586.7

In the next episode, we will take a closer look at replicated nodes and failure recovery.

Links

Episode 2: The one with replication.

Concurrency and Distributed Systems: A guide with Kubernetes

Written by Amin Shali