How we use Kafka

Humio is a log analytics system built to run both On-Prem and as a Hosted offering. It is designed for “On-Prem first” because, in many logging use cases, you need the privacy and security of managing your own logging solution. And because volume limitations can often be a problem in Hosted scenarios.

From a software provider’s point of view, fixing issues in an On-Prem solution is inherently problematic, and so we have strived to make the solution simple. To realize this goal, a Humio installation consists only of a single process per node running Humio itself, being dependent on Kafka running nearby. (We recommend deploying one Humio node per physical CPU so a dual-socket machine typically runs two Humio nodes.)

We use Kafka for two things: buffering ingest and as a sequencer of events among the nodes of a Humio cluster.

Ingest Queue/Commit Log

The ingest queue is used for buffering ingest, to take peak loads, or deal with similar situations. But the ingest queue also plays the role of commit log. That is, whenever we finish a block and append that to one of Humio’s segment files, we also write the offset of the last message batch received into the on-disk block. That way we can restart with no loss by reading the last block of the Humio segment under construction, and restart ingest from the Kafka offset listed in that block.

Event Sourcing

The primary coordination mechanism of our cluster state also goes via Kafka. The ‘shared cluster state’ is maintained using a straightforward event sourcing model, where we push updates through a single-partition/multiple-replica topic called global-events. All nodes dump a snapshot of the internal state periodically to local disk, and when a node starts up, it will read it’s own snapshot and re-load updates starting at the offset written in the snapshot. This has proven a very simple mechanism to deal with distributed state.

To be on the safe side there are some more details in this, such as also recording the Kafka-cluster-id as the epoch of Kafka offsets. So if you wipe your Kafka or redirect to a different Kafka cluster some manual intervention is required.

Humio/Kafka configurations

As for ways to run Kafka, Humio supports three different configurations:

  1. Single node: We provide one Docker container with Zookeeper, Kafka, and Humio in it. This makes it easy to try it out; just say docker run humio/humio and go.
  2. Run a dedicated Kafka/Zookeeper to service your Humio cluster. We provide specialized containers for this scenario that deal with some management issues.
  3. Bring-your-own Kafka. We try to stay out of the way of managing it.

The first option is great for a medium-small instance running on a single server. Also having both option 2 and 3 do complicate things somewhat.

On the other hand, many users want to run both Humio and Kafka as a cluster for scale and resilience. Unless the customer already has a dedicated team to manage Kafka, they’ll typically be using Humio with our humio/zookeeper and humio/kafka containers. In this scenario, we need to help users balance the Kafka cluster and configure replication etc. Humio uses a handful of topics that need varying configurations, and if the customer isn’t knowledgable with Kafka, it can easily be configured wrong (too small number of replicas, all replicas end up on a single node in the Kafka cluster etc.). So we ship the entire Kafka admin client with Humio and perform these re-configurations for the user. This mode is configured in Humio by setting KAFKA_MANAGED_BY_HUMIO=true.

On the other hand, for customers that are already managing a Kafka solution, we need to stay out of their way. A customer who ‘knows what they are doing’ will just be annoyed if we try to reconfigure their Kafka. To make sure our topic names don’t conflict with the existing ones, you can configure HUMIO_KAFKA_TOPIC_PREFIX to name space the topics we need.

Some details on our Kafka usage

We’ve found some issues that might be interesting for a Kafka’esque audience.

  • We don’t use Kafka’s built-in compression, but provide our own. The issue is that in a high-performance scenario, the stock LZ4 compression uses JNI that can cause the garbage collector to lock. With many consumers running in the same JVM, we can easily get into a situation where there’s always a process that prevents GC from happening, leading to long doing-nothing pauses. We did run into the same issue with our web framework’s use of gzip and replaced that with com.jcraft.jzlib.
  • We started using the deleteRecords functionality in Kafka (when ingested data is safely replicated to storage nodes, we can delete old entries in the ingest topic), but have since regretted that somehow. It took some versions of Kafka to make it work properly, and at the end of the day, the user needs the assigned retention space in case Humio has issues keeping up for some reason. It does save on typical disk space usage, but not worst case disk space needs. Should we decide today, we probably wouldn’t have used it; but now we may have customers depending on the disk space that’s released this way.
  • We use the endOffsets / beginningOffsets methods in the consumer to determine the size of the inbound queue and to determine if ‘we’re running behind’ in ingest processing. These calls can sometimes take a very long time, and so it’s hard to say if we’re behind because the queue is long, or because it takes a long time to determine if the queue is long.

Anyhow, we love Kafka. It has made implementing a distributed system significantly easier than if we had to do the distributed queue from scratch. Using Kafka as a building block has allowed us to focusing our attention on building a fast and pleasant search experience.