Understanding Kafka Tiered Storage

5 min readDec 28, 2023

Apache Kafka has become a cornerstone in the world of distributed data streaming, providing a robust platform for real-time event processing. As Kafka usage continues to grow, so do the challenges associated with managing and storing the vast amounts of data generated by these streaming applications. To address these challenges, the concept of Kafka Tiered Storage has emerged, offering a solution that enhances scalability and cost efficiency.

The Challenge of Scale

Kafka’s ability to handle high-throughput, fault-tolerant, and scalable event streaming has made it a popular choice for organizations across various industries. However, as data volumes increase, managing the storage requirements becomes a critical consideration. Traditional Kafka deployments often rely on direct-attached storage (DAS) or network-attached storage (NAS) for storing the data. While this approach works well for many use cases, it may not be the most cost-effective or scalable solution for organizations dealing with massive data sets.

Kafka Tiered Storage

Kafka Tiered Storage addresses the challenges associated with scaling Kafka clusters by introducing a tiered approach to data storage. The idea is to use different storage layers based on the characteristics of the data, allowing organizations to optimize costs while maintaining performance.

Hot and Cold Data

In a Kafka cluster, not all data is created equal. Some data, referred to as “hot” data, is frequently accessed and needs to be readily available for quick retrieval. On the other hand, “cold” data, which is less frequently accessed, can be moved to a more cost-effective and scalable storage solution.

Tiered Storage Architecture

Kafka Tiered Storage typically involves two main storage tiers:

Hot Tier (SSD/NVMe): This tier is optimized for high-performance and low-latency access. It stores recent and frequently accessed data, ensuring that the data most critical for real-time processing is readily available.
Cold Tier (HDD/S3): The cold tier is designed for cost efficiency and scalability. It stores historical and less frequently accessed data. By moving data to this tier, organizations can reduce storage costs while maintaining the ability to access historical information.

Implementation

Let’s explore a simplified example of how Kafka Tiered Storage might be implemented:

Configuring Hot Tier

The Hot Tier is designed to store recent and frequently accessed data, ensuring that this critical information is readily available for quick retrieval. This tier typically utilizes high-performance storage solutions like SSDs (Solid State Drives) or NVMe (Non-Volatile Memory Express) devices to provide low-latency access to the data.

1. Specify Hot Tier Directory

In the Kafka server properties, you need to define the directory path where the Hot Tier data will be stored. This is done using the log.dirs property. The specified path should point to a location on a storage device that offers high-speed access, such as SSDs or NVMe devices.

# Server Properties for Hot Tier
log.dirs=/path/to/hot/tier

2. Topic Configuration

Configure Kafka topics to use the Hot Tier for storing their data. By default, Kafka automatically creates topics if they do not exist. However, in a tiered storage scenario, you might want to disable automatic topic creation to have more control over how topics are configured.

# Disable Automatic Topic Creation
auto.create.topics.enable=false

# Topic-Specific Configuration for the Hot Tier
# Here, "my_hot_topic" is the name of the topic you want to configure for the Hot Tier.
# The log.dirs property points to the directory in the Hot Tier storage.
# You can also customize other topic-specific settings as needed.
topic.config.my_hot_topic=log.dirs=/path/to/hot/tier

3. Additional Tuning

Depending on your specific use case and requirements, you may need to tune other Kafka configurations for the Hot Tier. For example, you might adjust parameters related to retention policies, replication factor, and log segment sizes to optimize performance.

# Adjusting Retention Policy (example: retain data for 7 days)
log.retention.hours=168

# Adjusting Replication Factor (example: set replication factor to 3 for fault tolerance)
default.replication.factor=3

# Log Segment Size (example: set log segment size to 1 GB)
log.segment.bytes=1073741824

4. Monitoring and Maintenance

Implement monitoring mechanisms to keep track of the Hot Tier’s performance, disk usage, and other relevant metrics. Regularly review these metrics to ensure that the Hot Tier is effectively handling the high-frequency data and making adjustments as needed.

5. Data Movement Policies

Consider implementing data movement policies that define when and how data transitions from the Hot Tier to the Cold Tier. These policies could be based on criteria such as the age of the data or its access frequency. This ensures that only the most relevant and frequently accessed data remains in the Hot Tier.

dataMovementPolicy:
  criteria: age
  threshold: 7d

In this example, data older than 7 days is configured to be automatically moved from the Hot Tier to the Cold Tier. Configuring the Hot Tier is a crucial aspect of optimizing Kafka Tiered Storage for performance.

Configuring Cold Tier

Configuring the Cold Tier in Kafka Tiered Storage involves setting up Kafka to store historical and less frequently accessed data in the cost-effective and scalable cloud storage service. Here’s a step-by-step guide:

1. Specify Cold Tier Directory (S3)

In the Kafka server properties, specify the directory path for the Cold Tier data, which, in this case, is an S3 bucket. You’ll need to use a tool or a connector that supports writing Kafka data directly to S3.

# Server Properties for Cold Tier (S3)
log.dirs=s3://your-s3-bucket/path/to/cold/tier

Make sure you have the necessary AWS credentials and permissions set up for Kafka to write to the specified S3 bucket.

2. Topic Configuration

When creating a topic that will use the Cold Tier (S3), you need to set the log.dirs configuration for that specific topic to point to the S3 path. In this example, we'll create a topic named "my_s3_cold_topic."

# Create a topic named "my_s3_cold_topic" and set the log directory to the S3 path.
bin/kafka-topics.sh --create --topic my_s3_cold_topic --partitions 3 --replication-factor 3 --config log.dirs=s3://your-s3-bucket/path/to/cold/tier --bootstrap-server localhost:9092

Adjust Kafka configurations specific to the Cold Tier based on your use case and requirements, considering the characteristics of S3 storage.

# Adjusting Retention Policy for Cold Tier (example: retain data for 30 days)
log.retention.hours=720

Benefits

Cost Efficiency: By leveraging lower-cost storage options for historical data, organizations can significantly reduce storage costs without compromising on data availability.
Scalability: Kafka Tiered Storage enables organizations to scale their Kafka clusters more effectively by optimizing storage resources based on data access patterns.
Performance: The hot tier ensures that frequently accessed data remains on high-performance storage, maintaining low-latency access for critical applications.
Flexible Configuration: Organizations have the flexibility to configure and fine-tune the tiered storage strategy based on their specific needs and use cases.

Conclusion

As organizations continue to embrace real-time data streaming, the scalability and cost efficiency of storage become paramount. Kafka Tiered Storage provides a solution to these challenges by introducing a tiered approach that optimizes storage resources based on data characteristics. By strategically utilizing high-performance and cost-effective storage tiers, organizations can ensure optimal performance while managing storage costs effectively.