A Guide for Big Data Technology Using HDFS, Kafka, and Data Lake

Ian Gariando
The Startup
Published in
4 min readSep 15, 2020

With the advent of the internet, mobile connectivity, and the Internet of Things (IoT), data collected by organizations and made by each person has grown exponentially. In the last two years alone, over 90% of the world’s data has been created. Every day, 2.5 quintillion bytes of data are produced by humans; 95 million photos and videos are shared on Instagram; 306.4 billion emails are sent, and 5 million Tweets are made.

Big data is becoming business as usual for organizations with the volume, variety, and velocity of data being produced. It is not surprising that companies are looking to create their data strategies by implementing Big Data Technologies to keep up with the surging data and reap the opportunities that data provides.

What is Big Data Technology?

Simply put, big data technology is a software-utility used to manage Big Data on a commercial or organization-wide scale. It is designed to analyze, process, and extract the information from extremely complex and large data sets, which the traditional data processing software would never deal with. This technology manages both operational (day-to-day operations) and analytical (business intelligence) big data. Big data technologies are ready to assist in storing huge amounts of data, processing the data, manage data access across teams, and analyze huge data sets.

In this article, we will compare Hadoop, Kafka, and Data Lake to give a better comparison and understanding of these commonly used big data technologies.

Hadoop

Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines (1). Hadoop Distributed File System (HDFS) is the storage system of Hadoop which splits big data and distributes across many nodes in a cluster allowing local computation and storage. This also replicates data in a cluster thus providing high availability and backup in case of data loss. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Kafka

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol (2). It is an open-source software used to process real-time data streams, designed as a distributed transaction log. Simply put, Kafka is a messaging system and an event streaming platform. As same as Hadoop, Kafka replicates topic (similar to a folder in a filesystem), even across geo-regions or datacenters, so that there are always multiple brokers that have a copy of the data just in case things go wrong, you want to do maintenance on the brokers, and so on.

Data Lakes

A data lake is a centralized repository that allows the storage of structured and unstructured data at any scale. Data can be stored as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions. A data lake is an architecture in which Hadoop and Kafka are a component of and can run simultaneously or at the same time in the same ecosystem.

Which one to use?

Hadoop, Kafka, and Data Lakes have common functions and might be confusing to differentiate. The table below summarizes the different desired use of big data technology and which software is the most appropriate.

1. Data storage — Hadoop and Data Lakes are most superior and commonly used for large data storage at scale. Although Kafka has a storage option too (via topics), it has better use for connecting and streaming data.

2. Data replication — Replication is a smart way to ensure data are backed up and data loss is avoided. Across all technologies, this aspect is present.

3. Data/event streaming — Kafka is best for event streaming across multiple platforms (and can be put into the Data Lake). Although Hadoop has sharing capabilities, this occurs mostly in its TCP network, unlike Kafka which can stream data across platforms.

4. Data Distribution — all technologies have a default distributed function.

5. Data Ecosystem — Data Lakes runs like an ecosystem and a collection of different technology architecture. If a company looks at running a multi-software system, Data Lakes is probably a better

[1] https://hadoop.apache.org/

[2] https://kafka.apache.org/intro

--

--

Ian Gariando
The Startup

I write about fintech, sustainability, and all around topics happening around my environment.