Apache Kafka as a centralized state management platform — Part 1

Published in

DraftKings Engineering

8 min readJul 7, 2023

Introduction

Overview & Capabilities of Apache Kafka

The main mission of the Sportsbook domain in DraftKings (DK) is to provide a comprehensive sport offering, near real-time, to allow customers to enjoy their experience with a constant growing set of products, events, and capabilities. To handle this, the underlying infrastructure that powers the entire sportsbook system must adhere to strict criteria in terms of performance, stability, and fault-tolerance. Most of DK’s infrastructure is designed to be horizontally scalable, in order to accommodate for high volumes and to leave room for further growth, when necessary.

In order to reach the desired state, the DK architecture team needed a way to manage the data flow that would allow for high flexibility, low latency, and high stability. While the entire infrastructure uses several different technologies in different contexts, the focus of this article will be on one of the most used tools, Kafka.

Apache Kafka is an open source, distributed streaming platform that is used for building real-time data pipelines and streaming applications. It is designed for high scalability and high throughput, and can handle millions of events per second. Kafka is used for various use cases such as logging, data streaming, event sourcing, and more. It also provides durable, fault-tolerant storage and can be integrated with other systems and tools.

Kafka leverages a wide variety of patterns. The first one that usually comes into mind is data streaming, for communication between two different services, and that is one of the patterns that is widely used at DK. However, Kafka also has high potential as a platform for caching and state management. We’re going to explore 3 different patterns:

distributor of data stored in a SQL database for read-only purposes
centralized persistent layer for distributed applications
data sharding in order to allow for distributed processing

Before diving into these use cases, let’s have a small recap of some of the concepts leveraged by the architecture team when designing systems around Kafka.

Partitioning and Consumer Groups

Kafka works by allowing producers to publish a stream of messages in a topic, that is a logical grouping of different files (called ‘partitions’) where the actual messages are stored. It is essentially an append-only log where each message contains a key, a value, and optionally some headers. When a publisher pushes a message to a topic, the actual partition to which the message is pushed is defined by a partitioning mechanism usually based on the message key, although it may be possible to implement different partitioning logics. Typically, in our domain, partitioning is done based on the key, so messages with the same key will be stored in the same partition.

The concept of partitioning is important when it comes to consumption distribution: a set of consumers can leverage partitioning to evenly distribute the load. When a consumer registers to a topic, it provides the Kafka broker with a key called ‘consumer group’ that will identify the group this specific consumer is part of. Multiple consumers sharing the same group will distribute the consumption of the topic, sharing the partitions between them. This way, a topic that contains multiple partitions can be split between multiple consumers, with the upper limit being the total number of partitions the topic is composed of. Since a partition can only be consumed by a single consumer at a time for each consumer group, this guarantees ordering in the consumption of the messaging while allowing horizontal scalability.

Each consumer group can optionally register an offset, or “pointer,” in the partition, indicating the last message that was correctly consumed by any consumer of the consumer group. When a partition switches consumers within the group (as a result of changes in the topology of the system, redeployment, or scaling) the new instance will resume consumption starting from the last committed offset left from the previous consumer in the consumer group. This is made possible by a special node in the cluster that takes care of coordinating the consumer groups. If the offset is missing, the consumer can be configured to read the topic from the beginning or from the end, based on the use case that makes sense for that application.

More information can be found in the official Kafka documentation : https://kafka.apache.org/documentation/#impl_offsettracking

Retention Policies

Storage space is not infinite. One of the key features of Kafka is its ability to retain data for a specified period of time. This is known as the retention policy. Retention policies are used to control how long data is kept in a topic before it is deleted. It can be set on a per-topic basis, and defined using either time or size-based criteria. This way, our topics used for streaming data will keep messages only for a certain period of time. This is known as the ‘delete’ retention policy.
The most interesting policy for our use cases, though, is the ‘compact’ policy. When setting this policy on a topic, the cluster will delete duplicate messages, while keeping the latest or most recent version of each message. It is also incredibly useful to handle state management of an application, because there are several cases where we are interested only in the most recent version of the data, like when implementing a caching mechanism.

The two policies can be combined. A message can have all its duplicates removed and the most recent version can then be deleted from the topic following the delete retention policy.

Although there are multiple other configurations and concepts to explore that would take a lot of time to cover properly, and several considerations to make when choosing the proper strategy in terms of partitioning and data retention, these are some of the concerns to have when planning an architecture using this technology:

cluster performances
storage
producers’ speed
serialization
consumption strategies

More information about log compaction can be found in the official Kafka documentation: https://kafka.apache.org/documentation/#design_compactionbasics

Use of Kafka as a DB Cache Distributor

Several services in DK need to constantly evaluate incoming messages against the most recent version of data stored in the database. Given the number of services that require this info, a direct connection to the database for each one of these components would put too much strain on the database instance. Also, the amount of data that need to be retrieved by the application would generate too much internal traffic. To solve this situation, Kafka is used as a push-based mechanism to allow applications to store the data they need in memory, be notified when data changes, and be able to handle all their processing internally.

The distributor

A Cache Distributor is a microservice whose only responsibility is to interrogate the database at regular intervals using predefined queries. When the application starts, the entire dataset is retrieved and pushed into a dedicated Kafka topic. After initial load, only updates are retrieved and published, using the datetime of the last request that was successfully processed. The queries could involve complex joins of different tables, thus effectively generating a view that will be then exposed to every consumer, which won’t need to connect directly into the database (or handle their own caching system) as long as the staleness of the data is acceptable for their purpose. On less frequent intervals, the entire dataset is fetched and republished in order for the older data to not be removed from the topic due to retention. The interval for re-fetching the entire dataset must be shorter than the topic’s retention period to ensure that consumers do not miss infrequently updated data.

Topic configuration

The data is pushed to a compacted topic, with the key being the primary key of the main entity that was retrieved. This way, when compaction is executed, only the most recent version of the item will be kept in the topic. When a consumer needs to read the entire topic in order to populate its own internal state on startup, they will avoid consuming a huge number of messages while ignoring everything apart the last one, speeding up the startup time.

The topic also contains a delete policy retention, that will remove data after a certain amount of time. This is necessary to clean the topic from entities that have been deleted from the database.

The cache topic has a single partition. This is done because, in this particular use case, every consumer should consume the topic in full to maintain the entire cache in memory. By having a single partition, the broker performances improve. since it has to deal with less communication overhead.

The consumer

On the consumer side, the data is stored in memory. At startup, the topic is read from the beginning, and the messages are used to create an internal memory dictionary. If necessary, each application can create additional custom indexes to query the data based on their own needs. After the initial startup, the consumer keeps listening to the topic to receive regular updates, as they are going to be published by the producer.

Design Pattern of the DB Cache Distributors

Considerations

This system has several advantages, but also some drawbacks that need to be pointed out. First of all, it is memory intensive for the consumers, but this is true for every state management system that involves the use of Kafka, including the ones being discussed next. The full reload that happens on regular intervals can cause memory spikes while the application processes the set and discards the old values for the updated ones. This can be mitigated by reducing the amount of data the service stores from the topic to the minimum. If the cache contains an object with a lot of properties, but not all of them are necessary, it is better to define a lightweight object on the consumer side and store only what is needed.

Secondly, the staleness of the data cannot be configured at the consumer side, only at the distributor side. If a consumer requires data that is more up to date than what the cache provides, it will have to handle this in different ways. One of the classical problems found with this approach is the retrieval of information from the database that was changed as a result of the processing of the microservice itself. A possible workaround for this would be to update the internal dictionary (if all the data is known) before submitting the database change, but this can lead to data inconsistencies (especially if the update needs to be validated downstream) and is generally regarded as a sub-optimal solution.

In short, this system is not suitable to be used with components that are expected to update into the database (directly or indirectly) the same set of data they are going to retrieve from the cache.

In the next article, we’re going to explore two other patterns for state management that will overcome some of the downsides that we encountered till now:

Kafka as a centralized store for distributed applications
Kafka used to handle state sharding

The second part can be found here : https://medium.com/draftkings-engineering/apache-kafka-as-a-centralized-state-management-platform-part-2-980631412a5a

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!