Illustration: Taylor Vick

Solving the C10k challenge, in easy mode with ELK, on-prem.

Yoni W
6 min readAug 27, 2019

--

There are no many challenges left in the tech world that can be explained by a single sentence. Especially when you take out all the junior interview questions.

One of the few survives is The C10k challenge. Simply put: Make a back-end architecture to handle 10,000 concurrent users. (This is a software-engineer spin on the original)

Well, this challenge had some bone to it 10 years ago when we were vertically scaling servers. But today, any cloud engineer will simply reply “make it lambda”. Not that challenging. But for me, who need to deploy an on-premise (🤢) solution, the challenge is back on the table.

According to our client in the financial sector, peak concurrent user is around 10,000.

The problem: Nothing online

Stuff gets real when there is nothing online solving your problem. I mean, sure, everywhere you go you see a 3-node-Elasticsearch setup being recommended behind nginx as a load-balancer.

But before you can find the benchmarks, you are already thrown into the abyss called “config file optimization”. Do I even need it? What kind of load are you showing me? Is it per second?

Also, every additional server I declare as necessary is a cost to my client. It will also take time to assign from their on-premise server farm (likely running ESX). Therefore, I will always prefer the most minimalist setup I can have.

To summarize, all existing articles show a cloud architecture. Most of them not even giving a straightforward benchmark for decision making.

Fun fact: According to Wikipedia, nginx was created to solve the original C10k problem in 1999 (OS socket version) — 🎈Happy 20th Birthday 🎉

The Elasticsearch atom setup

3 Dockers inside an AWS EC2 VM

To approach the problem, I have decided to start with the minimal setup. All I did is to create a server on AWS with the following:

  • nginx(Port forwarding) docker
  • logstash (to handle JSON) docker
  • elasticsearch (changed the elasticsearch heap to 8 GB) docker
  • 4 vCPU
  • 16 GB RAM U
  • Ubuntu 19
  • I took the biggest JSON expected for logstash to digest as the designated POST payload

I used Redline13 to do the load testing. Redline13 is a great tool that spawn VMs using your AWS account and orchestrates all the tests from a simple dashboard. So, if your start-up received some aws credit, you are good to go. And even if not, all the tests shown here did not cost more than 3$ each.

The mission

Define as good and as verbose as you can the maximum performance one can get from this 1-node setup. And with some luck I will multiply it as needed to give enough support to handle 10K concurrent users. (Spoiler alert: Happy ending+)

C10k Hard Mode

I, of course, started not knowing much and ran the following test: 2500 Users making 7 request every 1 second under a ramp time of 10s. I thought that if no request will fail, I will decide that my single server is supporting 2.5K concurrent users. And, I will only need 4 servers to support 10K users.

The test did not complete successfully. 21% of the request failed to complete. Inspecting the errors showed what I feared the most, “Connection timed out after 30000 milliseconds”

I was quite devastated. “My server can’t even handle 2.5K!”, I told to myself. So, as any engineer trying to understand, I kept playing with the variables. I ran the next test with more delay between iteration to spread the load over time. It worked! Now only 12% failed. Suddenly I came to my first realization. “2500 users” is just a title, you need to look at other measurements to really understand what’s going on.

The Digestive system

Comparing the last 2 test, I investigated when users got errors.

Worse test (21% of requests failed):

In the worse test, users failed since 1:34:54
Worse test, requests over time, 1:34:54 is marked in circle

Better test (12% of requests failed):

In the better test, users failed since 1:56:59
Better test, requests over time, 1:56:59 is marked in circle

I noticed only 1 difference between them. How long was the system under test? In the worse test, the server was getting 400 req/s for ~35s and in the better test the server was still getting 400 req/s but only for 30s.

I realized that an important property of your complex system is the digestive rate. There is no point getting 1000 requests per second if your software can only handle 300 of them per second. The queue will fill out (however big it is) and the client will be timed out. We can conclude from the graph: My system digestive rate for my biggest JSON is under 400 requests per second.

Of course, nginx can handle more than 400 concurrent sockets. We are way below a rate where OS-level socket problems will start to occur.

To verify it, I ran another test with 2500 users creating 1 request each in a span of 1 second. With 0% failure result, you can now find your estimated digestive rate from the user-completion over time graph:

Every second, 380 users are handled, and since each make 1 request, we handle 380 requests per second.

I can handle 2.5K concurrent users! (given 2500/380=6.5sec to handle all of their requests)

C10k Easy mode ✌

With the understanding that my server can digest 380 req\s I moved forward. With simple math, given 10,000 users and a digestive rate of 380 req\s you can process them in under 27 seconds.

Here is a test where I ran 10K users creating 3 events each. I split them to 3 different time windows (each time window is 40 second to avoid having more than 380 requests at the same time because of randomness)

For exactly 40*3 seconds we handle 3 * 10,000 requests from 10K users
Because we took a long time window to be on the safe side, the server wasn’t challenged to more than 250 req\s
The view from the Kibana

The result: 0% Failure! All events are handled. I Solved the C10k challenge on premise with only 1 node setup! 🎉🎉🎉

But you may ask, how does that apply to the original C10k challenge? Well, I told you before that I needed to send data from a user device to the back-end system. But I didn’t tell you that in my particular case I can store the data and send it over time. Sending itself doesn’t have to be in real time. All I have to do is to make sure the users will only send once every ~40 seconds.

To handle X amount of requests per second, higher than your digestive rate, spawn more severs to support your needs (behind a load balancer). Now you can use the recommended 3-node-setup found everywhere.

I still think that the digestive property I showed will play a key role even in a 3 or more node setup. Good luck!

to recap:

Steps to measure your elk server performance

Given your unique architecture, and the maximum payload:

  1. Build the minimal setup
  2. Using 1 request per user (1 iteration), with different time-span, find your system digestive rate.
  3. Is it enough? (Can you spread the requests over time?)
  4. If not, multiply as needed and repeat those steps until satisfied.

Note: Before hitting me on my head about making a single point of failure, please be assured we will have more than 1 server in our client. But it is very reassuring to find that any of them can take the entire load by themselves. ⬛

--

--