Client-side Observations about Web Service Technologies: Using Apache Bench

Thundering Web Requests: Part 3

Published in

The Startup

10 min readDec 2, 2019

This is the third post in a series of posts exploring web services related technologies.

With the web service implementations in place, I evaluated them using an off-the-shelf benchmarking tool. This post documents the client-side observations from this evaluation.

Setup

Node Setup

In the cluster of six Raspberry Pi 3B nodes, one Raspberry Pi was designated as the server (node) while the remaining five Raspberry Pis were designated as client (nodes). Different web service implementations were executed in isolation on the server — The web service accepts a positive number n and returns a list of n random integers in the range 0 thru 999,999. Apache Bench was concurrently executed on the clients to issue service requests.

Benchmarking Tool Selection

Apache Bench (Apache HTTP server benchmarking tool) and wrk were the candidate tools. While wrk is more recent and multi-threaded, I chose Apache Bench version 2.3 as it was fast enough for this exercise, has been around for a long time, was available as part of Ubuntu distribution, and supports concurrent requests.

Number of Concurrent Requests (Connections) Configuration

To figure out how various service implementations fare with increasing number of concurrent requests, I decided to subject the implementations to 500, 2500, 5000, 7500, and 12500 concurrent requests. Consequently, each client issued 100, 500, 1000, 1500, and 2500 concurrent requests.

Network Traffic Configuration

Since Raspberry Pis were connected via a 100 Mbps Ethernet switch, each node could support to 94 Mbps (~12 MBps) network traffic. Since the server was serving five clients, it could uniformly support ~19 Mbps (~2 MBps) network traffic with each client. So, to evaluate the server at the limit of and below its network capacity, I chose 2 MBps, 6 MBps, and 10 MBps network traffic configurations.

Number of Random Integers (Compute Load)

Since the response to each service request is a comma separated list of 6-digit long integers in quoted string format, each integer contributes 9 bytes to the response payload. So, in the worst case, when all 5 clients issue c concurrent requests, the desired maximum network traffic of p MBps can be achieved by each client requesting p / (5*9*c) random integers. This resulted in 15 different concurrent requests and number of integers configurations.

The number of requested integers was controlled to ensure that, in a given network traffic configuration, the compute load associated with generating responses was constant across all concurrent requests sub-configurations.

Execution

An Ansible script orchestrated the execution of the service and the clients. For a given service implementation, an execution of the Ansible script starts the given service implementation on the server, warms up the service by issuing 200 requests, and then concurrently executes Apache Bench tool on every client to issue a specified number of concurrent service requests (with a timeout of 540 seconds). For each service implementation, the Ansible script was executed five times.

Keeping the Raspberry Pis Cool

Since the Raspberry Pis are not equipped with heat sink and fans, they get pretty hot under constant compute load. To avoid overheating of the Pis, the orchestration ensured every Raspberry Pi cooled down to 60 degree Celsius between every two consecutive executions. This was accomplished by repeatedly polling the content of /sys/class/thermal/thermal_zone0/temp and checking it was less than 60000. A timeout of 500 seconds with 5 second polling was used for this step.

Observations about Performance

For client-side observations, I considered only the data from the Ansible script execution (out of the five executions) with the most number of successfully completed executions of Apache Bench across the five client nodes. Ties were broken in favor of executions with most number of completed requests, most number of successful requests (a subset of completed requests), and highest median (across five clients) requests per second.

At 2 MBps

The below graph suggests

None of the implementations could service 10K requests in a second on a Raspberry Pi 3B.
Actix-Rust and Go implementations consistently performed ~2x better than all other (slower) implementations with Actix-Rust implementation performing better than Go implementation.
Of the slower implementations, in all but one configuration, Trot-Elixir implementation consistently performed best. It was followed by Cowboy-Erlang, Flash+uWSGI, Phoenix-Elixir, and Vertx-Kotlin implementations in different concurrent requests configurations.

Actix-Rust and Go implementations consistently performed ~2x better than the other considered web service implementations.

At 6 MBps

The below graph suggests all of the observations from 2 MBps configuration were true in 6 MBps configuration with minor changes.

Go implementation performed better than the Actix-Rust implementation in few concurrent requests configurations.
NodeJS-JavaScript replaced Phoenix-Elixir in the set of implementations that immediately followed Trot-Elixir in different concurrent request configurations.

Performance at 6 MBps [Click to enlarge]

At 10 MBps

The below graph suggests the observations from 2 MBps configuration are also true in 10 MBps configuration with minor changes to the leaders of slower implementations.

Why does the number of requests per second increase with number of concurrent requests?

When the network traffic does not change, as the number of concurrent requests increases, the number of requested random integers decreases. Consequently, the processing associated with each request decreases. Hence, the number of requests that can be served per second will increase if there is no additional processing; however, this is not the case.

As the number of concurrent requests increases, the processing associated with handling requests (connections) increases. However, this additional processing does not seem to affect the observed performance.

This behaviour can be explained as follows: the number of concurrent requests n received by a service implementation is below the maximum number of concurrent requests m that can be handled by the implementation, i.e., n≤m. Hence, the performance improves as n increases towards m.

The above explanation will be valid in configurations in which n is likely to be less than m; say, n≤1000. In configurations in which n is likely to be close to or larger than m (say, n>1000), if the above explanation is valid, then m is higher than 1500 or 2500. If so, then there shouldn’t be any failures in such configurations.

None of the web service technologies could service 10K requests per second on a Raspberry Pi 3B.

Observations about Failures/Reliability

Since Apache Bench was executed without -r switch, an execution of Apache Bench could prematurely exit upon socket errors. Also, while all requests may complete without socket errors, a fraction of them could be deemed as “failed” due to reasons such as data corruption or invalid response. So, it is possible that, during the Ansible script execution, failures on some clients can reduce the pressure on the service; thus, improving the performance of service in serving the remaining fewer clients.

Based on Failing Clients

To verify the above possibility, I considered the least and the most number of failing clients in various configurations as given in the below table. All instances where at least one client failed in all five executions of the Ansible script are highlighted in red. Instances with no failed clients are not shown, i.e., empty cells and non-existent columns.

Least (Best-Case) / Most (Worst-Case) number of failed clients in an Ansible script execution in various configurations. [Click to enlarge]

As seen in the table, in the worst case, there were numerous instances where clients failed when the number of concurrent requests was above 500. One of the likely reason for these failures is the service implementations were pushed beyond their capacity. Hence, the earlier explanation — all configurations were within the capacity of a web service implementation — does not suffice to explain the performance improvement at higher concurrent requests configurations. So, more exploration is needed.

Before digging deeper, here are few reliability related observations from the above table.

Across the board, Ktor-Kotlin, Vertx-Kotlin, and Flask+uWSGI-Python3 implementations were most reliable with zero failed clients in every execution.
Actix-Rust, Go-Server, and Trot-Elixir implementations were the next most reliable implementations with no more than two failed clients in at most two configurations in when considering worst-case executions.
In many configurations, NodeJS-Express-JavaScript, NodeJS-JavaScript, Micronaut-Kotlin, Ratpack-Kotlin, Phoenix-Elixir, Tornado-Python3, and Cowboy-Erlang implementations involved failed clients when considering worst-case executions.
In some configurations, every execution of Micronaut-Kotlin, Tornado-Python3, and Cowboy-Erlang implementations involved at least one failed client. [The cells with red text.]

While the above reliability observations are interesting, they may be incomplete. Specifically, implementation X can involve more failed clients than implementation Y but serve more requests than implementation Y. For example, in 1000 concurrent requests configuration, 3 and 1 clients could have failed for X and Y implementations, respectively, in an execution where X served 4997 = 2*1000 + 3*999 requests and Y serving 4300 = 4*1000 + 1*300 requests. Hence, the above observations need to be further examined.

There are two ways to examine their validity: 1) examine client-side request-level failures or 2) examine server-side data. In the reminder of this post, I’ll examine request-level failures (and explore service-side data in a later post).

Based on Completed Requests

When Apache Bench is executed without -r switch and completes successfully, it outputs the total number of completed requests and the number of failed requests (e.g., insufficient response content). If Apache Bench crashes (e.g., due to socket error), it only outputs the total number of completed requests. These two bits of information can be used to refine/validate the above reliability observations and the observed performance improvements in higher concurrent requests configurations.

For this analysis, I considered the Ansible script executions that yielded the most number of completed requests (best-case) (similar to least number of failed nodes) and the least number of completed requests (worst-case) (similar to most number of failed nodes).

In the below tables, for each web service implementation and concurrent requests configuration, the number of successfully completed requests (X) and the number of completed requests (Y) is reported (in X/Y format) when X is less than the maximum number of requests. Since five clients use Apache Bench to each issue n concurrent requests, the maximum number of requests in concurrent request configuration is 5 * n (given in the last row). Instances where the number of successfully completed requests is less than or equal to 95% of the maximum number of requests (given in the last row) are highlighted in red.

The first table reports numbers from the Ansible script execution with most number of completed requests (best-case) while the second table reports numbers from the execution with least number of completed requests (worst-case). Instances where maximum number of requests were successfully completed are not shown, i.e., empty cells and non-existent columns.

The Ansible script execution considered in these tables may be different from the ones considered in the table reporting about failed clients. Such differing instances are marked by * in the tables.

Number of successfully completed requests (X) and number of completed requests (Y) (in X/Y format) in various configurations when considering Ansible script executions with most number of completed requests (Best-Case). [Click to enlarge]

In the above best-case table,

While Flask+uWSGI-Python3 implementation did not involve failed clients, it involved failed requests in higher concurrent requests configurations.
While the Phoenix-Elixir, Trot-Elixir, and Cowboy-Erlang implementations fared well in terms of the number of failed clients, they did not fare well in terms of the number of successfully completed requests.

In the above worst-case table,

Even the worst-case behaviour of Ktor-Kotlin and Vertx-Kotlin implementations did not involve failed requests!!
If both performance and reliability were considered, then Vertx-Kotlin would be preferred over Ktor-Kotlin.
Actix-Rust and Go implementations exhibited the next best worst-case behaviour (in order) with at most two executions involving failed requests.
The implementations in the bottom-half involved failed requests in 1000, 1500, and 2500 concurrent requests configurations.
Deviating from the best-case executions, NodeJS-*, Micronaut-Kotlin, Ratpack-Kotlin, and the implementations in the bottom half of the table exhibited failed requests in pretty much all configurations involving 1000 or more concurrent requests in all network traffic configurations.
Ratpack-Kotlin, *-Elixir, Flash+uWSGI-Python3, and Cowboy-Erlang implementations involved failed requests in 500 concurrent requests configurations as well.
More than 5% of the issued requests failed in almost all instances/configurations involving failed requests. [Highlighted in red]
The data in the table suggests Cowboy-Erlang implementation failed at 100 concurrent requests and 6 MBps configuration. Upon closer examination of the raw data, no data was recorded for the executions of Apache Bench on three client nodes and the other two executions crashed prematurely. Further, the remaining four Ansible script execution completed successfully without failures. So, this 0/0 entry should be ignored.

Even the worst-case behaviour of Ktor-Kotlin and Vertx-Kotlin implementations did not involve any failed requests!!

Summary

Based on the client-side data,

None of the web service technologies could serve 10K requests per second on a Raspberry Pi 3B.
Actix-Rust and Go implementations were most performant.
Ktor-Kotlin and Vertx-Kotlin were the most reliable implementations followed by Actix-Rust and Go implementations.
If both performance and reliability were considered, then Vertx-Kotlin should be preferred over Ktor-Kotlin.

A Thought about Choosing Technologies

Every considered web service technology performed well in terms of speed and reliability as long as the corresponding web service implementation was executed in a environment that was not resource constrained, e.g., fewer concurrent requests and 2 MBps configuration. So, given the current trend of scaling out, using ease of programming — language and libraries — as the primary criteria to choose web service technologies makes sense unless/until resource costs and reliability are/become concerns.

Source Code

The code used for data analysis and graphics is available on GitHub.

Next Up

In my next post, I will examine the server side data from this experiment.