Four Crucial Steps to Take Before Changing Kafka Partition Key at Scale

Published in

AppsFlyer Engineering

8 min readMay 9, 2023

You’re a software developer and you are using Apache Kafka to process massive amounts of data in realtime. Your application is working well, but you’re starting to notice that your current partition key is not quite optimal for your use case. Knowing that changing the partition key could potentially help with performance and scalability, you may be tempted to simply change it. However, changing a Kafka partition key at a high scale carries many risks and requires careful consideration to ensure that it is done correctly.

At AppsFlyer we use Kafka extensively — 100 billion (!) daily events, so you can imagine that changing a partition key was not trivial at all for us. But we did it and I’m here to share how — we’ll go through four crucial steps to take before changing your Kafka partition key at a high scale.

From validating the distribution of the new partition key to communicating with downstream consumers, we’ll cover everything you need to know in order to make the transition as smooth as possible. By the end of this article, you’ll have a clear roadmap for how to successfully change your partition key and improve the performance of your application.

Not sure how to choose the right partition key for your application? Wait for my next blog post :)

Step 1: Validate the New Partition Key for Proper Data Distribution

As you probably know, the partition key is used by the producer to determine which partition a message will be written to, so it’s important to choose a key that evenly distributes messages across all partitions of the topic. Choosing a key that does not distribute well can result in skewness — uneven partition growth, where some partitions become overloaded with data while others remain underutilized. Skewness in data distribution causes some consumers to work harder than others, and this can lead to performance issues — including increased latency, slower processing times, and potential data loss.

To validate the new partition key, you can use a side consumer to read production data and simulate the target partition you would have had, had you used the new partition key. This can be achieved by subscribing to the original topic (which is partitioned by the old partition key), reading all messages produced to the topic, and simulating what would happen if the messages were produced with the new partition key instead.

As consumers read the messages and calculate the target partitions, you can use metrics to track the number of messages that would have been written to each partition had the new partition key been used. You can then compare these metrics to the actual partition distribution of the topic, to ensure that the new partition key would result in a balanced distribution of messages.

Step 2: Check the Impact on Producer Optimizations

When changing the partition key of a Kafka topic, it’s important to consider the impact on producer optimizations — particularly batching. Different partition keys can result in different batching patterns, as the distribution patterns of the keys across the data may differ.

To better understand this, let’s imagine the next stream of events:

Every event in the stream has two properties — shape and color. From looking at the stream, it is clear that partitioning it by color will not act the same as doing so by shape — the color property is more ‘clustered’, in a way that two consecutive events have a high probability of having the same color. At the same time, the shape property seems to be randomly distributed across the data.

A realistic example of this can be seen in an e-commerce application, where each purchase contains the user id and the time of purchase. The user id is expected to be randomly distributed in the data (similar to the shape property in our example), while the time of purchase is not expected to act the same — two purchases made at the same time will have the same value for that field, so we expect to see a distribution more like the color property in the above stream.

These different patterns can significantly impact the efficiency of producer batching, which is a risk — you’d be surprised at how much inefficient batching can increase the load on your Kafka cluster, and you definitely don’t want it to fail in production.

That’s why it’s so important to ensure that batching efficiency has not been negatively impacted under the new partition key. To do so, you may need to adjust three key producer configurations — linger.ms, batch.size, and buffer.memory.

linger.ms
The linger.ms property controls how long a producer waits before sending a batch of messages. If the new partition key results in quicker or slower filling of batches, you may need to adjust the linger.ms value to ensure that batches are still being sent at the optimal frequency.

batch.size
Similarly, the batch.size property controls the maximum number of bytes in a batch, so you may need to adjust this value to ensure that batches are not being sent too frequently or infrequently.

buffer.memory
Finally, the buffer.memory property controls the amount of memory available to the producer for buffering unsent messages. If you changed the previous properties to increase the batching time or size of batches, you may need to increase the buffer.memory property to ensure that the producer has enough memory to buffer messages and avoid potential performance issues.

In order to tune the different configurations, it is recommended that you raise a canary instance of your service with the new partition key, and see how different values of these configurations impact batching.

By checking the impact of the new partition key on producer optimizations and adjusting the relevant properties, you can ensure that message production remains efficient and that the utilization of the Kafka cluster and downstream consumers will be under control once that key is changed.

Step 3: Check the Impact on Downstream Consumers

If you’re changing the partition key of a Kafka topic, it’s crucial to take into account the potential impact on downstream consumers, especially if these consumers use the old partition key with their producers.

Changing the partition key may pose challenges for downstream consumers when creating batches with their producers. Currently, the downstream producer can easily fill batches because its consumer polls batches of messages that were previously batched by the same key. However, if the key is modified, the downstream producer receives data batched according to a different key, necessitating it to re-group the data according to the old key it is still using. This could lead to degraded batching and an unexpected increase in the load on downstream Kafka clusters.

To address this risk, the first step is to confirm whether a downstream service genuinely requires data partitioning based on the old partition key. If downstream consumers do require the old partition key, there are two potential solutions:

The last producing service in your pipeline could continue to produce data according to the old partition key, preserving backward compatibility and preventing any disruptions to downstream consumers.
Alternatively, downstream consumers could handle the changed partition key and make necessary adjustments to their systems. However, this approach carries risks as it assumes that all downstream consumers are known and are capable of handling the new partition key. It could also increase the load on downstream systems.

If downstream consumers do not require the old partition key, there are two possible approaches:

Use the default sticky partitions strategy, which evenly distributes messages across partitions, regardless of how they were previously batched. This partitioning strategy is highly optimized for batching.
Consider whether the downstream service could benefit from the new partition key, and adopt it accordingly.

Step 4: Communicate and Set Expectations

In the previous steps we’ve looked at a few technical risks when changing a partition key at scale and how to address them, and you can consider these risks as “known unknowns”.

The fourth step relates to the “unknown unknowns”.

Partitioning of in-flight data in high-scale systems is something fundamental to your system, with impact all across the board. Changing that could have a significant impact on everything along the way — including services and their applicative behaviors, Kafka clusters, and databases. That means many unexpected things can happen in the process.

That’s why it’s so important to ensure that all parties are aware of the change, understand the potential risks and benefits, and are prepared for any necessary adjustments. Making this process transparent reduces concerns in the organization and increases the chances of the process being successful.

You can achieve that by communicating the change to all stakeholders and setting expectations appropriately:

Reach out to key stakeholders in relevant teams, let them know about your planned changes and tell them about the potential benefits and risks. Get their feedback and address their concerns.
Publicly share the results of your preliminary steps. Elaborate on what looks good and what could use improvement.
Be proactive and demonstrate ownership when getting downstream services ready for the process (see Step 3 above).
Make preparations to minimize the time to revert the changes if needed. If possible, use a feature-flag to make it revertible without a deployment.
Publish a migration plan, outlying the actions on your side and which downstream service/infrastructure could be impacted during each of them.
Announce a timeline for the process, well in advance. Send a reminder before you start.
When making the change, monitor the system closely. Monitor both your services and those of other teams — basically anything that can be impacted by the change. Notify the relevant teams and stakeholders if you see anything unusual.
If something goes wrong and a service or an infrastructure is at risk, revert the changes. Remember, this is a complex process with many “unknown unknowns”. Things may pop only once you make the changes, so it might require a few iterations to get everything working as it should.

Communicating the process in advance and setting realistic expectations in a transparent way ensures that everyone is aware of the changes and risks, while also increasing the confidence of your fellow workers in the change process. This is an important precondition to successfully changing the partition key.

In Conclusion

To sum it up, changing the partition key of a Kafka topic at a high scale can have a significant impact on performance and stability, so it’s crucial to have a solid and thorough plan. Validating the new partition key for proper data distribution, checking the impact on producer optimizations, considering the impact on downstream consumers, and communicating with downstream consumers are all essential for a smooth transition. In the end, this is what truly allowed us to move with confidence when changing the partition key for over 100 billion daily events.