Apache Kafka Guide #19 Segment and Indexes

Paul Ravvich
Apache Kafka At the Gates of Mastery
4 min readFeb 1, 2024

--

Apache Kafka Segment Partition

Hi, this is Paul, and welcome to the #19 part of my Apache Kafka guide. Today we will discuss how to change Apache Kafka Segment and Indexes.

Segments

Alright, let’s talk about the structure of your topics. Topics consist of partitions, which we’re already familiar with. These partitions are composed of segments and files. I assure you, that these files are the final component. Each partition contains several segments, each represented by a file. Every segment encompasses a specific range of offsets. For example, Segment 0 might cover offsets from 0 to 934, while Segment 1 extends from offset 935 to 1567, and so on. The final segment, the active segment, is currently being written to. As a result, its final offset is yet to be determined. The writing process in this segment happens sequentially.

Apache Kafka Segment Partition

There is only one active segment. This is the most recent segment and the one where data is currently being written. Two crucial settings govern these segments. The first is log.segment.bytes, which sets the maximum size for a single segment in bytes. By default, this is one gigabyte. This means each segment file can be up to one gigabyte. If a segment exceeds one gigabyte, it will be closed, and a new segment will be initiated. There is also log.segment.ms, determining the time Kafka waits before closing a segment, even if it hasn't reached one gigabyte. The default period is one week. So, if data sent to Kafka over a week doesn't amount to one gigabyte, Kafka will still close the segment after a week and start a new one. Therefore, Kafka continuously creates new segments or files based on size or time.

  • Topic contains Partitions
  • Partition contains Segments
  • Only one Segment can be active for write operations.

Segment settings:

  • log.segment.bytes — max size of the Segment in bytes
  • log.segment.ms — the time before Kafka closed the Segment and created a new one.

Segment and Indexes

These segments are not standalone entities; they are accompanied by two types of indexes or files. Each segment possesses two indexes. One is an offset-to-position index, which aids Kafka in locating the start point for reading to retrieve a specific message at a designated offset. The other is a timestamp to offset index, which again assists Kafka in pinpointing messages with an exact timestamp. These optimizations contribute significantly to Kafka’s efficiency.

In this structure, we have a partition. Within each partition, every segment is equipped with a position index. As illustrated, there are various position indexes. Additionally, there is a timestamp index, with each segment having its timestamp index.

Apache Kafka Segment Partition
Apache Kafka Partition

Why should segments matter to you?

Consider this: setting a log.segment.bytes lower than the default one-gigabyte results in more segments per partition. I’ll explain log compaction in a future lecture. This process occurs more frequently with new segments, requiring Kafka to manage more open files. This could lead to a ‘too many open files’ error.

The rate of new segment creation depends on throughput. High throughput? No need to adjust this setting. Low throughput? Consider altering it to increase segment quantity and log compaction frequency.

Setting log.segment.ms below one week dictates the maximum frequency for log compaction. More segments mean more compaction triggers. For instance, a daily setting leads to daily rather than weekly compaction, enhancing cleanup frequency.

Conclusion

I’m sharing this so you’re aware of the role of segments and indexes and their operational dynamics. In Kafka’s data folder, you’ll notice these within folders, helping you understand their significance. However, it’s crucial not to modify these settings initially or at all.

Thank you for reading until the end. Before you go:

--

--

Paul Ravvich
Apache Kafka At the Gates of Mastery

Software Engineer with over 10 years of XP. Join me for tips on Programming, System Design, and productivity in tech! New articles every Tuesday and Thursday!