What is Apache Kafka?
Apache Kafka was initially designed and implemented by LinkedIn in order to serve as a message queue. Since 2012, Kafka has been open sourced under Apache Foundation and quickly evolved into a distributed streaming platform, which is used for the implementation of real-time data pipelines and streaming applications.
It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
Kafka is a fault-tolerant and horizontally scalable platform, that can scale to hundreds of brokers being able to process millions of records per second. It was designed to simplify data pipelines, handle data streams and support batch and real-time analytics. It also supports at most once, at least once and exactly once semantic processing guarantees.
Why organisations use Kafka?
Modern organisations have various data pipelines that facilitate the communication between systems or services. Things get a bit more complicated when a reasonable number of services need to communicate with each other at real time.
The architecture becomes complex since various integrations are required in order to enable the inter-communication of these services. More precisely, for an architecture that encompasses m source and n target services, n x m distinct integrations need to be written. Also, every integration comes with a different specification, meaning that one might require a different protocol (HTTP, TCP, JDBC, etc.) or a different data representation (Binary, Apache Avro, JSON, etc.), making things even more challenging. Furthermore, source services might address increased load from connections that could potentially impact latency.
Decoupling data pipelines with Kafka
Apache Kafka leads to more simple and manageable data pipelines, by decoupling services and components. Kafka acts as a high-throughput distributed streaming platform where source services push streams of data, making them available for target services to pull them at real-time. In contrast to more traditional legacy systems that still rely on batch processing, Kafka enables consumers to process events upon their arrival. Therefore, Kafka is heavily used by companies as a platform that serves several real-time applications, including Fraud Detection, Log Monitoring and Recommender Systems.
Although Kafka was initially designed to serve as a message broker, we have already seen that it has evolved into a distributed streaming platform that can support various use cases including:
- Message brokers
- Data Integration
- Log collection and monitoring
- Real-Time Analytics
- Event processing
- Decoupling (micro)services
- Integration w/ other Big Data technologies (Spark, Hadoop, Storm, Flink)
Kafka is currently being used in production environments by many Fortune 500 companies, including LinkedIn, Uber, AirBnB, Netflix, Twitter and Barclays. Netflix and LinkedIn, use Kafka in order to provide real-time recommendations to their users, while Uber uses it to collect, transform and aggregate data that is being fed into pricing models.
About the author
Giorgos works as a Data Scientist and Quantitative Python Developer at Barclays in London. He is in charge of delivering high performing fraud detection Machine Learning models on extremely large datasets for BUK portfolios. He is also an active contributor on StackOverflow. In the past, he has worked as a Data Engineer, being responsible for the implementation of highly secured and scalable real-time data pipelines. He holds a BSc in Computer Science and a MSc in Data Science. You can find him on StackOverflow, LinkedIn, Medium and Twitter.