How to benchmark a Kafka consumer that makes async HTTP requests?

Published in

helpshift-engineering

8 min readOct 16, 2019

In this article, I want to share my learnings from a task that we performed recently at Helpshift to benchmark a system which consumes events from Kafka and makes async HTTP requests based on information in the payloads.

Before we begin, just a quick note on the two types of HTTP requests. Async HTTP requests are different from sync HTTP requests in that async requests do not block the caller, the responses can be handled via callbacks. Sync requests on the other hand are blocking. So, it’s simpler to calculate the latency of systems that makes sync requests- just check how long the request took, while for systems that make async requests, we need to keep track of when the request was triggered and when the response was received. The trick that we used was to send a timestamp in the request and use that in the response callback to calculate how long it took.

With that context set, I’ll be covering these things:

The system
Goals
Setup
The benchmark process
Conclusions

The system

In a gist, the consumer upon receiving events, would extract information required to make the HTTP request and handles the response via callbacks. We have used Lee Hinman’s clj-http client library.

Goals

We had 2 basic goals:

Find the acceptable latency of the consumer with expected throughput
Identify the optimal HTTP client config

Simply put, we wanted to utilize the resources on the consumer node as best as possible, and needed to understand the number of events we would process per second on the planned hardware configuration.

Setup and terminology

It required 3 nodes:

Producer node: This had a traditional datastore component and a Kafka producer component. The datastore was required to dump some sample data, that the consumer would fetch and use it to make async HTTP requests.
Note: In real world, the datastore is independent of the producer (not on the same node). Since the intent of the benchmark was to test against a controlled flow of events, it didn’t matter that the producer and datastore were on the same node.
HTTP endpoint provider (mocked): A node that has a simple HTTP server and some sample routes. The endpoint provider performs operations on data in an in-memory database with randomised response times based on observations in production.
Consumer node: Node with the system under test running.

Apart from this, we also have Graphite as our primary monitoring platform, we used this to plot consumer as well as JVM metrics.

The benchmark process

Once we had all the nodes ready, the next thing was to run the benchmarking code with varying RPS (rate of events per second) and HTTP client config to get the most efficient numbers. We started with 500 RPS. We throttled the events being produced using Bruno Vecchi’s throttler. Here’s an excerpt:

Initial Run

For the first run, we started with this http config on the consumer-

After the first run, we noticed the response time for HTTP requests was an unremarkable 5s. We have an SLA of 5s for the entire roundtrip, including internal processing of the response, so 5s was unacceptable. The consumer lag of 1.8K, though not alarming, was a bit on the higher side. On doing some basic checks, we realised that this was due to one endpoint in the mocked endpoint provider that was deliberately made to timeout (added for heterogeneity of responses). Even though the endpoints were picked up at random, there was a certain pattern to the randomness which caused this particular endpoint to be picked up more often in the runs.

For the next run, we removed this endpoint and the response time quickly went below 2s. After this, we tried increasing the RPS further to 700, however incoming requests on the mocked server were maxing out at 600 RPS. We suspected three things:

HTTP connection requests were getting piled up at http-client due to limited number of threads and this created back pressure.
The in-memory DB on mocked server was getting overloaded because all requests were populating it and none were deleting any data, so the same requests were taking longer to fetch data that was growing linearly in size and since we had set a timeout of 5s, a lot of requests started timing out and this eventually created back pressure.
The throttler library was not able to handle this level of throttling.

Debugging

To eliminate #1, we decided to review Pool stats published by the Apache Commons HTTP library. These stats are a rich source of information on metrics like Available Connections, Leased Connections, Pending Connections etc. What we were interested in was the number of pending connections, accessible via .getPending(). This is a useful stat for having graceful shutdown as well, since you don’t want to stop a node which has pending requests piled up.

We plotted the number of pending connections on HTTP client. We noticed that there were no pending connections, so this was ruled out.

To eliminate #2, even though it was highly unlikely this would be the cause for the no. of incoming requests on consumer to plateau, We added a middleware that would clean up the in-memory DB after every 10K requests. We also removed an endpoint that was returning a large response on purpose. As expected, nothing changed. We kept the change though, cleanup of in-memory DB seemed logical considering the benchmarking would run for hours eventually.

The only thing left was to remove the library and introduce custom throttling. We did this and it worked. We replaced the throttler with a configurable parallel execution of code (using Clojure’s pmap) that produced events. Code looked something like this:

Tuning http-client parameters

Once the throttled events production problem was out of the way, we noticed that as the no. of requests increased, the response time was going over 5s, even though the timeouts were set at 5s. How was this possible? From graphs, we also noticed that the response time was increasing gradually over time. On further research, we realised that the connections were getting queued up by the connection manager. This meant that in order to maintain the system invariant(processing an event should take no more than 5s) we had to:

Timeout the thread waiting for a connection to open up (because from the point of view of the system, this was time spent processing the event). Note that this is different from conn-timeout which times out when the connection manager has allocated a connection, but the endpoint provider could not be reached within the specified time.
Ensure that we don’t pull events off the queue if connections are not available

To fix the problem of threads waiting forever for the http-client connection manager to provide a connection, we introduced ‘conn-request-timeout’ in the HTTP config and ran the benchmark again. Much better performance, the lag also reduced to under 1K. This was an acceptable number, but the lag, even though small, had to be debugged. A lot of debugging was centered around checking how clj-http client works and checking the benchmarking environment setup.

In the benchmarking environment we had setup, the endpoints being hit were limited(as compared to what would be in production) and the default-per-route was configured keeping in mind that we had an upper limit on the number of requests we would make to a single endpoint in parallel. This was adding a restriction on the number of simultaneous connections that can be made per route. Increasing this value equal to the number of threads reduced the lag considerably as it used up threads that were lying idle earlier. We also wanted to max out requests being made from the system under test.

Simulating production traffic

Now, we started benchmarking with realistic RPS to get the figures that we would expect in production. Since our expected traffic on nodes in this subsystem was 60 RPS, we tried with this. Also, since this will be a new service and we don’t expect traffic on this service immediately, we reduced the number of threads in HTTP client config to 20 and increased default-per-route to 20(to avoid problem described above). With this, the HTTP response time was under 600ms, the consumer lag was about 100.

So, putting the numbers together-

Average processing time: 600ms
No. of parallel requests: 20
Average consumer lag: 100
Requests are picked up in batches of 20, so a new event will be processed in the 5th batch(100 / 20)
Wait time in queue = 4 * 600ms = 2.4s
Processing time for the new event = 600ms
Total processing time = 2.4s + 600ms = 3s

3s was acceptable for us. Increasing the RPS proportionately increased the lag, which is an expected side effect.

Conclusions

Some things to keep in mind while benchmarking consumer making async HTTP requests-

If you’re benchmarking response time of async (or sync) HTTP requests, to keep things realistic, either have a middleware that adds random latency per route in the HTTP endpoint provider or if you have an endpoint that is made to timeout, have a probability selection instead of picking the routes randomly. This will avoid running into situations where the response time is unexpectedly high due to the timeout route being picked more often.
It is useful to have a good mix of HTTP endpoints(high/low response times, for example) if you want to play around and benchmark client configurations on a per route basis.
Ensure your internal services are not the bottlenecks- a lot of times we realised that having too many services on the same node was causing slowdown. Have clear boundaries between these services to avoid one overwhelming the other. Debugging becomes simpler.
Apart from the consumer metrics, capturing JVM metrics(in case of a Java application) like memory usage, thread count etc., can also be useful to see which part of the application is slowing down the process.

I want to extend my gratitude to my team that helped me in this benchmarking process- Pardeep Shergill, Vedang Manerikar, Dinesh Chhatani , Neha Mishra, Samuel Chase and Rubal Jabbal