How I started my journey on contributing to Apache Kafka

Phuc Tran
4 min readApr 29, 2024

--

Just a head start for anyone who is new to Apache Kafka, it is an open source real-time event-streaming platform (for simplicity speaking, it is a pre-built solution for any backend service that need to process and stream data at a large scale and volume to another or many others backend services). Kafka was built internally within Linkedin, and then it went to become an open source project under the Apache Software foundation in 2011.

I have an important advice to any developer who is considering contributing to an open source project, that is once should only start contributing to an open source project which they are already using. Reason is, at their cores many projects are open sourced so that the users of these project can make a pull request for a new feature, or have a look at the source code themself to identify a problem that they were facing, or just simply want to understand the underlying logic of the software that they have been using. Blindly trying to contribute to an open source project without prior knowledge(s) or usage(s) goes against these values, and it does not help the one who is trying to make a contribution nor the maintainer of the project.

With all that being said, here is how my contributing process started:

During the last quarter of 2023, I found myself getting curious about how the frameworks which I use to develop software, such as Spring or ReactJS just to name a few, work under the hood. However, these 2 projects overwhelmed me with the number of components that they have, specifically in the case of Spring, it has a plethora of libraries, and these libraries are complicated on their own. I have a look through some of the most recent issues on their github pages at the time, but after awhile I gave up, since I could not fully grasp the problem behind any of them.

This is when I pivoted to Kafka. The team behind Kafka uses Attlassian’s Jira to track their issue tickets, still the source code of Kafka is on github. With Kafka I also experienced being overwhelmed by the project’s complexity, what make it easier for me to get started with taking on ticket is that the project has a more updated “beginner-friendly” tickets list compared to the 2 prior projects that I mentioned. I ended up taking a ticket from this list, to which the ask is to decide whether should the String formatter “%s” be replaced by formatter “%d” in a number of files to avoid locale-sensitive issues. After assigning myself the ticket, I created a fork of Kafka and started working on the issue. Half way through investigating the issue, I realized that this ticket was more of a mean to get me to read through a the files that was mentioned, as most of the data was used were of integer type or long type, and they are whole number, which is not gonna cause any locale-sensitve issue, I came to this conclusion due to the fact that numbers are general represented the same accross many langugues. I think the issue reporter who opened this ticket wanted anyone who assign themself to solve this issue, to get a glimp of the Apache Kafka codebase, that is still my assumption only as I never asked the reporter this question myself. I considerred this ticket is an easy one.

After submitting a pull request to the main branch, I redirected myself on taking other tickets on the Kafka’s Jira, as I didn’t get an immidiate response within that week regarding the ticket, after all it was a low-priority issue. One of the tickets that I have taken has the ask for investigating why there was a dupplicated byte array which contain the Kafka event’s value in the event’s header itself. I had a lot of struggles with this ticket as I was never able to reproduce the issue that the reporter mentioned, and I also didn’t get any sample code from the reporter back to investigate, even though I asked for it to aid the debugging process. Looking back on it, I think this ticket was a big jump from my previous one. It required me to have the knowledge of how event are constructed and sent on the kafka broker, in which I was lacking at the time of picking this ticket up.

I took a few months break from contributing to kafka since my last attempt at solving a problem on the issue board was not a success, and my first pull request has not been reviewed yet. During that one month I was focusing on how to find more ticket that are within my ability to solve, as my weakness at that time was mainly my lack of understanding onhow the codebase works in general. Later on I found out that Apache Kafka community has these KIP records to track all major changes in the codebase, major changes here includes any major new feature, subsystem, or piece of functionality and any change that impacts the public interfaces of the project (more details can be found here). I have a read through some of those records and decided to assign myself some tickets which relates to KIP-848: The Next Generation of the Consumer Rebalance Protocol. Details on those tickets and how I tackled them will be included in my next blog.

If anyone interested in this series, consider following me on Medium or LinkedIn. Thanks for reading until now.

--

--