Building a lossless Kafka Producer

Pavan Yekbote
Hevo Data Engineering
3 min readMar 8, 2021

Hevo is a no-code data pipeline as a service. Customers across various geographical regions rely extensively on Hevo to move their data across different data stores. As a result, maintaining data integrity and low latency of transfer becomes crucial. Hevo relies on Apache Kafka as a message queue. Kafka is an open-source streaming platform that is fast, scalable, durable, and fault-tolerant.

Kafka and Data Integrity at a large scale: What is the challenge?

Kafka works by persisting published messages onto a broker and having subscribers read and derive their own representations of data. Kafka provides a producer (publisher) and a consumer (subscriber) library to implement the same. Since all messages that are written to the broker are persisted onto the disk, there is no loss of data due to an application crash, broker crash, etc. However, when you write to Kafka via the producer thread, the messages are first written to an in-memory buffer and a Kafka sender thread is responsible for reading these messages from the buffer and reliably syncing those to the broker. Therefore, there is a brief period of time where messages haven’t been synced to the disk, meaning any failure that can cause this buffer to lose its content can result in a loss of data.

In order to tackle this problem, client applications can choose to wait for the messages to be synced before proceeding with the next steps, and in case of failures, republish the same message. However, in applications such as Hevo that work at a large scale and require sub-millisecond latencies this operation becomes very expensive due to the wait time associated with each message.

How did Hevo tackle this problem?

Kafka provides callbacks when messages are finally synced i.e. persisted to the broker. Hevo writes messages to a local memory mapped write-ahead log before publishing messages to Kafka while periodically checkpointing message offsets on the basis of failure or success callbacks. Messages that have been successfully synced are periodically discarded from the local write-ahead log. In the case of non-graceful shutdowns, Hevo will recover potentially lost messages by republishing messages written to the local write-ahead log from the latest committed checkpoint. In case of failure callbacks, the records will be pushed to a queue and republished periodically.

As we have already established that Hevo needs to have low latencies at a large scale, it becomes important that this added mechanism does not become a bottleneck.

Lossless Kafka Producer Workflow

Persistence and Speed: How is this achieved?

Hevo relies on BigQueue to ensure fast and persistent writes to the write-ahead log as well as the queue. BigQueue is an out of the box implementation of memory-mapped queues and arrays that guarantee low latency. It internally relies on the concept of memory-mapped files. Memory mapping bridges the gap between persistence and in-memory (speed) operations.

BigQueue boasts the following features:

  • Fast — Performance Benchmarks
  • Reliable
  • Persistent
  • Memory efficient
  • Lightweight
  • Big — total size of BigQueue is only limited to the disk space.

These features make BigQueue the perfect candidate to solve the problem of fast and persistent writes. As a result, this entire process adds only about a few microseconds per message written to Kafka. However, this depends on a few factors such as thread count, message size, etc.

BigQueue Write Latency

What is the Recoverable Kafka Producer?

We at Hevo understand that data integrity at scale is a problem that is widespread. Therefore, we have open-sourced our solution. Recoverable Kafka Producer is an open-sourced Kafka Producer built at Hevo. It has been built to achieve data integrity with sub-millisecond latencies. It is easily configurable so one can tune the producer to fit perfectly to their applications’ needs.

--

--