Kafka Series — 0. Introduction
With my growing interests in big-data infrastructure and distributed systems in general, I decided to start contributing to an open source project. On a search for a good project, I evaluated Hadoop, Spark, Redis, Memcached, Project Voldemort and Kafka. My criteria for selecting a project are as follows —
- Familiarity with the Dev workflow which will enable me to start contributing the project ASAP.
- The impact and ubiquity of the project.
- The breadth of distributed systems concepts and challenges which can be covered.
- Future possibilities.
With these criteria, most of the projects I had in mind scored fairly high. I decided to learn Kafka. Kafka has become an important infrastructure piece in the big-data world. Kafka enables setting up real time stream processing pipelines with almost infinite scalability and support for different ingestion streams and endpoints. The following are the important decision points —
- It is central to the big-data infrastructure interacting with various components. This would enable me to get a good breadth of knowledge working with various projects even if it’s only just configuring them.
- There is a definite move towards more real-time streaming processing of data. Kafka is built for it. This would mean really good future prospects
- Following up with #2, because the problem kafka is trying to solve here involves both high speed and high scale, I personally am drawn to it, because it could lead to some very interesting issues to be solved.
- Finally, I have a good familiarity with Git and Jira, which Kafka (along with other projects evaluated) use.
In this series, I will write short blog posts about how a novice (with some programming experience in non-distributed systems) can navigate through the complexities of a distributed system and contribute code to an important project.
The approach that has worked for me in the past involves the following steps -
- What is the goal of the project? — Understand the problem the project is trying to solve and more importantly NOT solve.
- Deploy the project and use it. Get some hands-on experience with the project.
- Familiarize with the development workflow — How to get code? How to make changes? How to review changes? How to submit changes? How to file bugs? What is the etiquette?
- Fix a simple bug (like changing a function name, fix a typo, remove a log, etc.)
- Testing — Figure out the various tools available and learn to use them.
- Figure out the various channels for communication and the etiquette to follow in each.
- Get a good architectural overview of the project, and if possible the right people to talk to when in doubt.
- Finally, make notes, write wikis, write articles and talk to as many people as possible.
You may notice that I lay stress on getting on the ground early and start being productive before the architectural overview. This is because, over the years, I have realized, it’s almost impossible for me to know everything about a project in a short duration. Also, knowing everything about the architecture does not necessarily translate into committed code, which is what I need to stay motivated. I prefer to be productive and continue exploring the architecture by picking up bugs from the area I am interested in. You may want to follow a different route to contributing. If there is something that you feel I should include, feel free to comment!