Server-side Observations about Web Service Technologies: Using Apache Bench
This is the fourth post in a series of posts exploring web services related technologies.
Having run the experiment using Apache Bench and made client-side observations, I will examine the server-side data from the same experiment in this blog post. So, to get the most out of this post, please read the previous blog post before reading this post.
Server-side Data Collection
Captured Time Per Request
On the server-side, each web service implementation logged the time taken to process a request by a single thread of execution. Specifically, this is only the time taken by the handler code to interpret the web request parameters, generate a list of random integers, and serialize the list as the JSON payload of the response. It does not include the time taken by the underlying web service technology to accept HTTP requests, parse the raw bytes of a request, to route the request, and to ship the response payload to the client. Hence, this time serves as a good lower limit for the true time per request.
Derived Requests Per Second
The inverse of logged time was used as the derived requests per second. Given the nature of logged time, this derived requests per second serves as a good upper limit for the true requests per second.
Serviced Requests, Expected Requests, and Unserviced Requests
For this exercise, each logged time entry was considered as representing a serviced request. Since the implementations did not log info about partially serviced requests, the difference between the number of expected requests (e.g., 2500 requests in 100 concurrent requests configuration) and the number of serviced requests was treated as number of unserviced requests.
As while making client-side observations, the data from the Ansible script execution with the highest number of serviced requests was used to make observations.
Observations about Performance
To make observations about performance, the median requests per second is plotted against varying number of concurrent requests.
At 2 MBps
As in client-side observations, Actix-Rust and Go implementations were ~2x faster than the other slower implementations.
If the logged time is the only amount of time to process a request, then it seems both Actix-Rust and Go implementations can serve 10K requests per second. However, this is highly unlikely as the processing time to handle connections was not captured and the performance observed on the client-side suggest otherwise.
Of the slower implementations, NodeJS-* implementations performed better than the remaining implementations.
At 6 MBPs
While Actix-Rust implementation was still the best performer, its performance dropped below the 10K. However, at 100 concurrent requests, the performance of Go implementation dropped sharply from 400 requests per second at 2 MBps to ~30 requests per second at 6 MBps.
Of the slower implementations, both NodeJS implementations still performed better than the remaining implementations.
At 10 MBPs
All of the observations from 6 MBps were true at 10 MBps. Further, the drop in performance of the Go implementation extended to 500 concurrent requests configuration — from ~1000 request per second to ~300 requests per second.
What‘s with the change in performance of the Go implementation?
Of the various concurrent requests and network traffic configurations, only six of the 15 configurations request more than 200 integers. Most of these configurations are associated with 100 and 500 concurrent requests configurations.
The performance of the similar service code logic in Go (i.e., generate and serialize random integers) at varying levels of concurrency (i.e., varying number of go routines) starts to differ significantly when the number of generated and serialized integers get above 200; observe the region between 0–500 on the x-axis in the below graph.
This explains the observed change in performance of the Go implementation. In other words, if higher number of integers were requested at higher concurrent requests configurations, then the change in performance of the Go implementation would have seemed less sharp/pronounced (as the performance would be lower across all configurations).
Why does the number of requests per second increase with the number of concurrent requests?
At given network traffic, as the number of concurrent requests increases (e.g., 100 to 500 concurrent requests), the number of requested random integers decreases. Consequently, the processing associated with serving each request decreases. Hence, the number of requests being served per second increases (as determined by the execution time of the request handler).
However, this does not imply the number of requests being served per second as observed by the client will increase because the cost of handling requests/connections by the web service technology and the underlying system could play a non-trivial factor in the total time required to serve a request.
Why do some implementations exhibit lesser than one requests per second?
Many of the implementations in the experiment were both concurrent and parallel. So, an implementation could have been handling multiple requests in parallel in different threads of execution at the same time. Hence, this derived requests per second should be scaled by the number of threads of execution serving requests in parallel to arrive at the true number of requests per second.
The number of threads of execution that were active at all times during the lifetime of a web service is unknown in this experiment. However, we know Raspberry Pi 3Bs have four CPU cores. So, assuming maximum parallelism at all times, the reported requests per second can be scaled by four to arrive at the upper limit on the actual number of requests per second. With such scaling and considering measurement errors, most of the requests per second measurements would be close to or more than one.
In most configurations, Actix-Rust and Go implementations were ~2x faster than the other implementations.
Observations about Failures/Reliability
As with client-side observations, I looked for failures on the server-side. Specifically, did the implementations receive and handle the expected number of requests? Also, how did they fare in the best-case (most number of serviced requests) and worst-case scenarios (least number of serviced requests)?
The related data is given in the below tables. In the tables, the configurations in which the number of serviced requests were not the same as expected number of requests are not shown, i.e., empty cells. Also, the configurations where less than 95% of the expected requests were serviced are highlighted in red.
In the above best-case table,
- The implementations in the top-half of the table serviced all of the expected number of requests!!
- All of the implementations in the bottom-half of the table led to unserviced requests in 2500 concurrent requests configuration.
- Micronaut-Kotlin, Cowboy-Erlang, and both Elixir implementations lead to unserviced requests in 1000, 1500, and 2500 concurrent configurations.
- Almost all entries match the entries in the best case table in the client-side observations post.
In all configurations with unserviced requests, the number of requests serviced as observed on the service-side is greater than the number of successfully completed requests as observed on the client-side. This implies service implementations can function well when observed on the server-side and still appear to fail on the client-side due to errors 1) on the server after the service handler finishes executing, 2) during network transmission, or 3) on the client side before the requestor receives the response.
Reliability/Failures as observed on the server-side need not be the same as the reliability/failures as observed on the client-side.
In the above worst-case table,
- Of the implementations in the top-half of the table, only Ktor-Kotlin and Vertx-Kotlin implementations serviced all of the expected number of requests!!
- Actix-Rust and Go implementations exhibited the next best worst-case behaviour with only a handful of failed requests in 2500 concurrent requests configurations.
- NodeJS-*, Micronaut-Kotlin, Ratpack-Kotlin, and many implementations in the bottom-half of the table led to unserviced requests in pretty much all concurrent requests configurations at all network traffic configurations. This wasn’t the case in best-case executions.
- Compared to client-side worst-case table, Trot-Elixir (at 1000 concurrent requests and 2 MBps configuration) and Cowboy-Erlang (at 100 concurrent requests and 10 MBps configuration) implementations serviced all of the expected number of requests while a few of these requests failed on the client-side.
Even the worst-case behaviour of Ktor-Kotlin and Vertx-Kotlin implementations did not involve any failed requests!!
The server-side data supports almost all of the client-side observations. Specifically,
- Based purely on request serving cost and not request/connection handling cost, Actix-Rust and Go implementations seem to be capable of serving 10K requests per second on a Raspberry Pi 3B. However, this is highly unlikely.
- Even so, Actix-Rust and Go implementations were most performant.
- Again, Ktor-Kotlin and Vertx-Kotlin were the most reliable implementations followed by Actix-Rust and Go implementations.
- If both performance and reliability were considered, then Vertx-Kotlin would be preferred over Ktor-Kotlin.
A Thought about Choosing Technologies
In terms of reliability, not all technologies are alike. Some technologies exhibit better reliability across a broader range of configurations than other technologies; supporting observations were also made on the client-side. Also, locally (server-side) observed reliability does not always translate into remotely (client-side) observed reliability. The reason for such change in reliability could be on the server end or on the client end. Further, the granularity of the data (failed clients vs failed requests) considered to make observations has a non-trivial influence on the observations. So, while choosing technologies, reliability of technologies should be thoroughly evaluated; specifically, as observed locally and remotely, under different loads/configurations, and at a level of granularity that matters to clients.
The code used for data analysis and graphics is available on GitHub.
In my next post, I will examine the data from repeating the experiment using custom web clients.