Kafka Series — 0. Introduction

With my growing interests in big-data infrastructure and distributed systems in general, I decided to start contributing to an open source project. On a search for a good project, I evaluated Hadoop, Spark, Redis, Memcached, Project Voldemort and Kafka. My criteria for selecting a project are as follows —

  1. Familiarity with the Dev workflow which will enable me to start contributing the project ASAP.
  2. The impact and ubiquity of the project.
  3. The breadth of distributed systems concepts and challenges which can be covered.
  4. Future possibilities.

With these criteria, most of the projects I had in mind scored fairly high. I decided to learn Kafka. Kafka has become an important infrastructure piece in the big-data world. Kafka enables setting up real time stream processing pipelines with almost infinite scalability and support for different ingestion streams and endpoints. The following are the important decision points —

  1. It is central to the big-data infrastructure interacting with various components. This would enable me to get a good breadth of knowledge working with various projects even if it’s only just configuring them.
  2. There is a definite move towards more real-time streaming processing of data. Kafka is built for it. This would mean really good future prospects
  3. Following up with #2, because the problem kafka is trying to solve here involves both high speed and high scale, I personally am drawn to it, because it could lead to some very interesting issues to be solved.
  4. Finally, I have a good familiarity with Git and Jira, which Kafka (along with other projects evaluated) use.

In this series, I will write short blog posts about how a novice (with some programming experience in non-distributed systems) can navigate through the complexities of a distributed system and contribute code to an important project.

The approach that has worked for me in the past involves the following steps -

  1. What is the goal of the project? — Understand the problem the project is trying to solve and more importantly NOT solve.
  2. Deploy the project and use it. Get some hands-on experience with the project.
  3. Familiarize with the development workflow — How to get code? How to make changes? How to review changes? How to submit changes? How to file bugs? What is the etiquette?
  4. Fix a simple bug (like changing a function name, fix a typo, remove a log, etc.)
  5. Testing — Figure out the various tools available and learn to use them.
  6. Figure out the various channels for communication and the etiquette to follow in each.
  7. Get a good architectural overview of the project, and if possible the right people to talk to when in doubt.
  8. Finally, make notes, write wikis, write articles and talk to as many people as possible.

You may notice that I lay stress on getting on the ground early and start being productive before the architectural overview. This is because, over the years, I have realized, it’s almost impossible for me to know everything about a project in a short duration. Also, knowing everything about the architecture does not necessarily translate into committed code, which is what I need to stay motivated. I prefer to be productive and continue exploring the architecture by picking up bugs from the area I am interested in. You may want to follow a different route to contributing. If there is something that you feel I should include, feel free to comment!