Kafka and Protobuf — Ingesting Data into Our Platform

99P Labs
99P Labs
Published in
5 min readNov 22, 2021
@wildmax on unsplash

Author: Ben Davis

When we ingest a source of data, it can go though multiple stages of transformation before it is ready to use. Typically, each stage passes data to the next by writing it out to storage and signaling the next stage that new data is available. Recently we introduced Kafka into our environment to eventually be the backbone of our pipeline. In a previous post I discussed using gRPC and Protocol buffers as a way to deliver large datasets to a client. With this in mind, it seemed like a good idea to store the messages in Kafka as Protobuf messages. We already know from previous experiments that gRPC significantly speeds up data transfer. I reasoned that it should be even faster for reading messages since you can skip serialization and directly send the Kafka message payload as-is. But is that assumption true? Is it really worth the trouble dealing with a binary protocol and all the associated code for the potential performance gain?

The setup for the first experiment looks like this. It’s one server and one client with neither one competing for resources with anything else.

There are two topics containing the same messages but one is JSON and the other Protobufs. The timestamps on the messages are spread out over a year and published roughly in order. I ran five requests of increasing size several times and averaged the elapsed time in seconds from when the server receives the request to when it closes the stream.

The results show the efficiency gain by the API server when it doesn’t have to spend time deserializing/serializing messages. Another side benefit is that the Protobuf topic was about 5 times smaller than the JSON topic.

The purpose of the next experiment is to compare the elapsed time for the clients. We had really only planned on having Python gRPC clients but I was surprised to see that the Python gRPC client was much slower than the equivalent REST client. This is what the setup looks like:

The Go gRPC and Python REST clients are pretty even, and as expected, the gRPC client improves as the record count increases.

The expectation that gRPC would be faster than REST comes from the results of a previous experiment that focused on comparing the data transfer rates. As it turns out, the clients in that previous test were mostly waiting on data from the backend datasource, which masked the limits of the client. Now with a local instance of Kafka as our datasource, the API server is able to stream data fast enough that the clients hit their limit. So what’s up with Python?

The gRPC documentation explains it this way:

Streaming RPCs create extra threads for receiving and possibly sending the messages, which makes streaming RPCs much slower than unary RPCs in gRPC Python, unlike the other languages supported by gRPC.

The default maximum gRPC message size is 4MB. This is the main reason why we are using the streaming RPC and not the unary RPC. The returned message size is unbounded.

Not willing to give up on gRPC for Python just yet, in this last experiment I split each single request into ten smaller requests and submit them concurrently using Python’s multiprocessing package. The number 10 is an arbitrary choice.

Multiprocessing brings the elapsed time closer to being comparable to the REST client.

There is one other issue with the multiprocessing approach. The only information the client has that it can use to split up the request is time. The API server takes that time constraint and translates it into partitions and offsets. Kafka can tell you the offset of the oldest message for a given timestamp, but it cannot guarantee that you won’t miss some late arriving messages. For example, suppose you want the messages from the month of February of last year. Kafka can tell you the offsets for the first messages from February. You can also get the offsets for the first messages in March. That gives you a range of offsets to read from Kafka. The possibility exists that some events from February didn’t arrive until March, or even later. Those messages would be outside the range of offsets you are requesting. As a countermeasure you could pad the end date so that you read messages past the end

5

of February. How much padding to use would depend on what you know about how the data is collected and what your tolerance is for missing late messages.

Conclusions

Originally I set out to answer two questions.

1) Should we store messages as Protobufs in Kafka or not?

Even though these tests were run with a single server and client, it can give you an idea of how the system might perform under the load of hundreds of simultaneous gRPC calls. Storing the messages in a Kafka topic as Protobufs will save a significant amount of storage space, and the API server will have less work to do by eliminating the need to serialize messages. The performance gain of not serializing will be more apparent with a server under heavy load.

2) Should we use a Streaming API or stay with REST and JSON?

On the client side, the answer is less clear. Written in languages such as Go, C, or Java, gRPC clients can take full advantage of streaming. With a Python client, if the network bandwidth is limited, gRPC will help, but you still have to deal with the added complexity of concurrent processes and late-arriving data considerations. I don’t think it’s an either-or proposition. Both methods can co- exist. The one to choose will depend on how much data you need and how long you are willing to wait.

Stay connected with us at the 99P Labs Developer Portal.

--

--