KAFKA UPGRADE — A HASSLE-FREE JOURNEY

Published in

T-Mobile Tech

9 min readOct 21, 2021

Our goal for this article is to share the experience of our Kafka upgrade with others in the data ingestion world. We hope our journey from a ‘couple generations behind’ self-managed infrastructure to a managed service that reduced the cognitive load on our operations team, while increasing security and unlocking newer features, will help you on your journey as well.

A Brief introduction of D3

D3 is a Data ingestion service that is used by multiple upstream systems for posting diagnostic, analytic and telemetry type of data. The data is consumed from D3’s data lake by data owners and users to materialize business use cases, helping customers and improving D3’s services. D3 is a relatively high throughput and low latency system. We process about 3 billion events a day, about 500GB of compressed data.

As with many clickstream/data ingestion services, D3 has Kafka at the core of its system.

Backstory on our Kafka upgrade

When D3 was built originally, a few years ago, it was one of the forerunners in a completely cloud based infrastructure and used the latest stable Kafka for that time. As with any well implemented Kafka based solution, our system was still running great, but it was time to get current and solve a few accumulated tech debt pain points. Our key drivers for this Kafka upgrade were:

· Stability — be the latest stable version of Kafka.

· Security — use the opportunity to do a security review and implement the latest guidelines.

· Operational Ease — Convert from a self-managed to a managed service solution.

· Unlocking the Latest Kafka Capability — To be able to support more use cases and more partners by using some of the latest features like Kafka streams and connect.

This is what set the course for the D3 Kafka Upgrade.

Searching for a Solution Path

We used the opportunity to learn from the experiences of other major Kafka users at T-Mobile and considered self-managed vs. managed, Amazon service, or Confluent service. We preferred Managed Service due to the benefits of not having to set up and maintain Kafka infra, ease of future upgrades, out of the box capabilities, and professional service support.

After reviewing costs and enterprise agreements, we finalized the Amazon MSK for our Kafka upgrade Managed Service solution. Then the real work began.

Overall Kafka Upgrade Implementation Approach

With the goal of moving from an older version of Kafka (circa v0.10.0.1) i.e. deployed on our cloud infrastructure to Amazon MSK v2.5.1 without impacting existing data ingestion, we adopted the approach outlined in the diagram below:

· Setting up a parallel Amazon MSK cluster on our AWS account with VPC connected to ours via ENI.

· Upgrade Kafka client libraries ( v2.5.1) on producers and consumers.

· Implement 2-way SSL for all connections to MSK.

· Set up parallel producers and consumers for testing in lower environment.

· Set up parallel consumers for managing production cutover with no data loss.

Implementation Summary

Besides the obvious changes of updating the client library to be compatible with the target Kafka version, we covered following aspects:

Security

We covered security at three different levels:

· Network : Rules that allow connections from security groups corresponding to known producers and consumers.

· Encryption in transit: 2- way SSL for all connections to MSK with server chain in ACM-PCA (Amazon Certificate Manager -Private Certificate Authority) and client certs in a location accessible to our ci/cd pipeline.

· Encryption at rest: Ensured data at rest is encrypted with custom encryption key stored in KMS.

Below is a view into the adjusted ci-cd pipeline to handle client keys for 2-way SSL

Updated pipeline — Consumers:

Updated pipeline — Producers:

Amazon MSK provisioning with Terraform

With the goal of automating, creating, and updating the MSK cluster, we decided to use Terraform, primarily due to Terraform being generic and cloud provider agnostic.

Here is a view into our pipeline for automation:

(PS: It took about 40 minutes for our prod MSK cluster to come up with this automation).

Monitoring & System Telemetry

Although Amazon MSK does come with a basic monitoring setup, it was not enough for our needs because we required more details. Instead, we decided to use an inhouse monitoring set up to export the metrics, feeding into the existing Prometheus to have multiple Grafana charts covering data movement and the Kafka system.

Example — Kafka producer requests, topic wise:

We built charts/alerts for

· Bytes In Per Sec

· Bytes Out Per Sec

· Bytes Rejected Per Sec

· Partition Counts per Broker

· Under Replicated Partitions per Broker

Example — System health metric ( CPU)

We built charts for other key metrics such as

· Available Filesystem Bytes per Broker

· Network Request Rate per Broker

· Error Rate per Broker

· Zookeeper Request Latency

Testing covering Application and Performance

To ensure there was no loss of functionality and that our partners were continuing to get current or better performance levels, we covered a range of tests.

· Functional Tests: We covered all data ingestion source/scenario combinations into Postman runner and externalized the environment configurations. The runners were executed after build/fix cycles with verifications to confirm that the intended payloads reached the target data lake.

· Performance Tests: We executed performance tests like 2x, 3x etc. using the Gatling load test tool triggered via Gitlab. Tests include combination of soak and load with consideration for 2-way SSL connections.

Our observations

· Horizontal scaling observation: (increasing/decreasing the count of MSK brokers) The process to create new brokers took longer (about 20–30 min). We observed that new topics got distributed across MSK cluster, however older topics prior to horizontal scaling had to be adjusted manually. We used the Reassign partition tool to redistribute existing topic partitions over new broker nodes

· Vertical scaling observations: (moving to bigger instance type with same count of broker nodes) In this case, brokers are updated in sequence. The incoming rate decreases slightly for a short period while the brokers are being updated. The replication rate shows significant dips during broker updates. While under heavy load JMX response times from the brokers can go as high as 15s. After the updates were done, we noticed that the message ingestion rate increased and CPU Utilization decreased.

· Vertical downscaling: (going to smaller instance type with same count of broker nodes) While this should not be a common scenario for well forecasted throughput increases in production, it is a thing. And the general recommendation is to consider downscaling only when the CPU load is less than 30%. Based on the current instance types that Amazon offers, it appears the CPU factor changes by the factor 2. For example, if the current CPU is 60% and we downscale to the next smaller instance type for MSK, CPU could be in the range of 110–120 %.

· EBS storage volume scaling: This can be scaled independently of the brokers. There is also an out of the box auto scaling feature to auto-expand your broker storage at a specified storage utilization target. PS: storage size cannot be decreased

· Plaintext vs 2-way TLS: We observed that the ingestion rate dropped by 20–30% using 2-way SSL and CPU increased by about 15–20% with no significant change in message lag. So, it’s good to keep this in consideration when sizing MSK cluster.

And that led us into MSK sizing. Hang in there; we are getting closer to the Kafka upgrade in production 😊

MSK Production cluster sizing

Amazon provides a Sizing Tool which can be used to estimate size based on traffic (IO). We used this tool to calculate preliminary numbers. The final numbers were established based on the legacy Kafka setup, anticipated throughputs in near future, and our performance test outcomes.

Detailed cutover planning and execution

One of the important phases is to establish a clear cutover plan in production. Possibly test the plan at lower scale in a non-production environment. Our main goal was not to lose any data being ingested and not to impact our partners and data users. So, we ended up setting up parallel consumers for MSK (besides existing consumers from legacy Kafka) and during cutover moved producers to point to MSK. Our production cutover went smoothly, and we were very happy with the results.

Few good takeaways, things to consider:

Partitioning for high throughput topics

After the migration, we noticed consumer lag grew on certain topics as we got more data for those topics. So, we increased the partitions on the top 10–15 traffic heavy topics and within about an hour the consumer lag dropped to within a few seconds.

So, the backstory on this: Topic partition is the unit of parallelism in Kafka. Writes to different partitions can be done in parallel on the producer and broker side. However, on the consumer side, Kafka always gives one partition’s data to one consumer thread. So, the degree of parallelism within a consumer group is kind of bound by partitions being consumed. Our observation is that more partitions in Kafka cluster lead to higher throughput. There is no right number per se, it depends on the throughput of topic and the number of consumers in the setup.

Key Configurations we looked at -

While most of the Amazon MSK defaults serve well, we explored a few useful configurations and adjusted to our needs:

· default.replication.factor:

· log.retention.hours:

· unclean.leader.election.enable: This is an important setting: If availability is more important than avoiding data loss, ensure that this property is set to true. If preventing data loss is more important than availability, set this property to false. This setting operates as follows:

o If set to true (enabled), an out-of-sync replica will be elected as leader when there is no live in-sync replica (ISR). This preserves the availability of the partition, but there is a chance of data loss

o If set to false and there are no live in-sync replicas, Kafka returns an error and the partition will be unavailable

· offsets.topic.replication.factor: Typically set higher to ensure availability

· offsets.retention.minutes: Allows to move the offset back to the beginning if it isn’t changed within a set period of time. This is particularly useful in case all consumers go down. Not a common situation but good to consider

· replica.lag.time.max.ms: Determines the time period that an in-sync replica can fully catch up with the leader. This property is particularly useful if there is a spike in traffic and large batches of messages are written on the leader, unless the replica consistently remains behind the leader for replica.lag.time.max.ms, it will not shuffle in and out of the ISR. This in turn helps in improving the resiliency of the system.

Our Kafka upgrade journey completed

We have been running on Amazon MSK for Kafka for over a month now and our partners are happy. Actually, they didn’t even realize this upgrade happened. And they would have never found out if not for our partner, the D3 team, who notified them.

We hope you enjoyed reading about our Kafka upgrade journey from a couple generations behind self-managed infrastructure to a managed service that reduced our cognitive load on our operations team, while increasing security and unlocking newer features.

Keep in mind that not all data ingestion systems are built the same. So what we did as part of our journey may not fully apply to all situations. Some measures could be excessive for relatively simpler needs while some features should be more fully explored.

Good luck with whatever brought you here to this topic. Happy to get better and learn about your experience. So please consider sharing your feedback and comments.