Big Data Concepts: Overview

Abhinav Vinci

3 min readJan 17, 2023

Terms Discussed:

Part 1:

Concepts : Map Reduce , HDFS
Tools : Hadoop, Spark

Part 2:

Tools (Hive, Apache Pig), Data Stores (BigTable, HBase, Cassandra)

The Concepts :

Map Reduce

The MapReduce model is composed of two main functions: the map function and the reduce function.

The map function takes an input dataset and applies a user-defined operation to each element, producing a set of intermediate key-value pairs.
The reduce function then takes the intermediate key-value pairs and combines all the values that have the same key, producing a set of final key-value pairs. The key-value pairs represent the output of the MapReduce operation.

Benefits of MapReduce?

MapReduce is particularly useful for processing large data sets in a distributed computing environment because it allows for the computation to be split into smaller tasks that can be distributed across multiple machines. This parallel processing allows for faster processing times and can also help to handle failures, as each task can be executed independently.

Drawbacks of MapReduce?

MapReduce is powerful for handling big data, but it’s not always the best fit for every case, as it’s not efficient for iterative algorithms, or for real-time processing.

MapReduce in practice ?

Some popular open-source implementations of MapReduce include Apache Hadoop and Apache Spark.

HDFS

Hadoop Distributed File System (HDFS) is a distributed file system that is designed to store and manage large amounts of data across a cluster of commodity servers. It is the primary storage system used in the Hadoop ecosystem.

Benefits of HDFS?

HDFS is designed to work with large files, typically in the range of gigabytes to terabytes. It uses a block-based storage approach, where files are split into fixed-size blocks and each block is stored on a different DataNode. This allows HDFS to handle large files and also allows for parallel processing of data.

How it works ?

It uses a master/slave architecture, where the NameNode acts as the master node, and it is responsible for managing the file system namespace and controlling access to files. The DataNode acts as the slave node, and it is responsible for storing the actual data.

The Tools:

Hadoop

Hadoop consists of two main components: the Hadoop Distributed File System (HDFS) and the MapReduce programming model.

Benefits of Hadoop?

Hadoop’s main advantage is its ability to handle big data, it can store and process large datasets, distributed across a cluster of commodity hardware. It also allows for parallel processing of data, which can significantly improve performance.

Hadoop ecosystem includes several other projects such as:

Pig: A platform for creating MapReduce programs used with Hadoop
Hive: A data warehousing and SQL-like querying tool for Hadoop
Hbase: A NoSQL database that runs on top of Hadoop

Spark

Apache Spark is an open-source, distributed computing system that is designed for fast and flexible big data processing.

It is built on top of the Hadoop ecosystem and provides an alternative to the Hadoop MapReduce programming model.

Benefits of Spark ?

Spark’s main advantage over Hadoop is its in-memory processing capabilities, it can keep data in memory, which can significantly improve performance when compared to Hadoop’s disk-based storage. It’s also often used to process real-time data streams, and its ability to process data in memory makes it suitable for iterative and interactive workloads, which Hadoop MapReduce is not efficient for.

Spark provides a number of libraries for different types of data processing tasks, such as:

Spark SQL: Allows for querying structured data using SQL-like syntax.
Spark Streaming: Allows for processing real-time data streams.
MLlib: A library of machine learning algorithms that can run on top of Spark.