Topic Log Compaction in Apache Kafka
Topic Compaction
- Unlike the retention policy criteria, where Kafka automatically removes messages from Kafka servers when the time or size is reached, you can clean up messages manually with the help of the Log Compaction Process.
- In other words, using the log compaction method, you can selectively remove messages from each topic partition where the records are replicated or present more than once.
Time-Based Retention
- Time-based retention is specified by setting the cleanup.policy to delete and setting the retention.ms to some number of milliseconds. With this set, events will be kept in the topics at least until they have reached that time limit.
- Once they have hit that limit, they may not be deleted right away. This is because event deletion happens at the segment level.
- A segment will be marked for deletion once its youngest event has passed the time threshold.
Key-Based Retention
- Compaction is a key-based retention mechanism. To set a topic to use compaction, set its cleanup.policy to compact. The goal of compaction is to keep the most recent value for a given key.
- This might work well for maintaining the current location of a vehicle in your fleet, or the current balance of an account. However, historical data will be lost, so it may not always be the best choice.
Compaction also provides a way to completely remove a key, by appending an event with that key and a null value. If this null value, also known as a tombstone, is the most recent value for that key, then it will be marked for deletion along with any older occurrences of that key. This could be necessary for things like GDPR compliance.
Usage and Benefits of Topic Compaction
- Because compaction always keeps the most recent value for each key, it is perfect for providing backing to ksqlDB tables, or Kafka Streams KTables.
- For updatable datasets, for example, database table data coming into Kafka via CDC, where the current value is all that matters, compacted topics allow you to continue receiving updates without topic storage growing out of hand.
Don’t forget to hit the Clap and Follow buttons to help me write more articles like this.
And, if you are looking for summarized articles on Apache Kafka, you can also check my previous articles like Foundational Concepts of Kafka and Its Key Principles, Kafka — Data Durability and Availability Guarantees, and Why is Apache Kafka fast?