Apache Hadoop (HDFS -YARN- MapReduce) and Apache Hive Echosystem

M. Cagri AKTAS
12 min readSep 19, 2023

--

https://hadoop.apache.org/

Hello everyone. We will be discussing Hadoop, Hive, and some SQL queries in Hive, with a special focus on YARN, Hive, External and Internal table types.

Hive is essential for us because it forms the foundation of big data, within the Hadoop ecosystem. Although Hadoop is not as commonly used in current technologies, it serves as an inspiration for the next-generation technologies. That’s why it’s important for us to understand the Hadoop ecosystem.

Fundamentals of Data Engineering — Joe Reis & Matt Housley (2022)

If you look at the graph, you can see how rapidly big data is growing. Alongside this growth, it’s crucial to understand new technologies and the Hadoop ecosystem, which forms the foundation of big data.

Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025
https://www.statista.com/statistics/871513/worldwide-data-created/

Hadoop

Hadoop is helpful computer system that can handle big data. It is free to use because it is open-source, meaning lots of people can contribute to it and make it better. Two people named Doug Cutting and Mike Cafarella came up with the idea for Hadoop. They named it after a toy elephant owned by Doug Cutting’s son.

If you wish, you can check this Hoa Nguyen ‘s article to understand the connection between cloud and Hadoop. She built worker nodes from the cloud and used manager node locally. But I’ll be discussing everything in the context of the local.

Hadoop Distributed File System (HDFS): HDFS is a fundamental part of Hadoop. It’s like a super-sized file storage system designed to hold enormous amounts of data. HDFS takes your large files and chops them into smaller pieces, usually about 128MB or 256MB each. These pieces are then spread across many computers in a cluster. This approach makes sure your data is safe and can be accessed really fast.

https://www.quora.com/How-does-the-huge-amount-of-data-get-stored-in-HDFS

MapReduce: MapReduce is a way of processing data in Hadoop. It’s like a recipe for handling big datasets. When you want to work on a massive pile of data, you break it down into smaller tasks. MapReduce has two main steps:

  • Map Phase: In this step, you take your big dataset and divide it into smaller chunks. Each chunk is processed separately by different computers in the cluster. This phase creates a bunch of key-value pairs based on the work done.
https://www.quora.com/How-does-the-huge-amount-of-data-get-stored-in-HDFS
  • Reduce Phase: After the Map phase, you gather and organize all the key-value pairs produced. Then, you perform some kind of calculation or analysis on these pairs to get the final result. This might be counting things, finding averages, or any other kind of computation.
https://stackoverflow.com/questions/29769356/does-execution-of-map-and-reduce-phase-happen-inside-each-datanode-by-node-manag

MapReduce helps you work on huge datasets efficiently by dividing the work across many computers.

YARN (Yet Another Resource Negotiator): YARN is like the manager of the Hadoop cluster. When you have different tasks or programs running on the cluster, YARN decides how much of the cluster’s resources (like CPU and memory) each task can use. It ensures that all the tasks run smoothly without fighting for resources.

In simple terms, Hadoop is a powerful system for dealing with massive amounts of data. It has its own way of storing files (HDFS), a method for processing data (MapReduce), and a manager (YARN) to make sure everything runs smoothly. It’s a critical part of the big data world and helps organizations make sense of their enormous datasets.

PS: One thing we need to explain is that while cloud systems are highly preferred over Hadoop in today’s world, Hadoop still holds a significant place for various reasons. This is because, due to privacy and many other factors, the use of cloud systems is not preferred due to government or corporate regulations. Therefore, there is a need for locally operated distributed systems.

Hive: Hadoop Hive is a data warehousing and SQL-like query language tool that is built on top of the Hadoop ecosystem. It was developed by Facebook and is now an open-source project under the Apache Software Foundation. Hive is designed to make it easier to work with and query large datasets stored in the Hadoop Distributed File System (HDFS) or other compatible distributed storage systems. Almost everything is same with MySQL.

DEPLOYMENT

We have two storage clusters and one YARN manager. When we write a query in Hive, YARN distributes this task as MapReduce jobs to the two clusters, processing the query and returning the results.

The user mentioned above doesn’t process the data individually but rather distributes it. Our YARN manager name is “Cluster-master” and our Worker Nodes are namely “cluster-slave-1” and “cluster-slave-2.”

YARN manager: Cluster-master serve as the entry point for external clients and users to interact with the cluster. These nodes host client applications and tools, making them essential for submitting jobs, transferring data, and managing the Hadoop environment.

Worker Node: clusterslave-1 and cluster-slave-2, in HDFS, are at the core of the Hadoop cluster. They store and manage distributed data blocks, ensuring data availability and fault tolerance. Worker Nodes execute data processing tasks, such as MapReduce or Spark jobs, handling the processing logic and producing results.

Les’t explain that:

Big data processing frameworks like Hive and Hadoop use parallel processing methods to perform big data processing tasks in a distributed manner. These operations are carried out by MapReduce or other parallel processing frameworks.

When a query is executed in Hive or Hive commands are used to process data, Hive can use underlying Hadoop MapReduce or another parallel processing framework to process the data and obtain results. The process works as follows:

Query Analysis: The Hive query is analyzed, and based on the logic of the query, MapReduce tasks are generated.

Data Splitting: Since the data is stored in HDFS in the form of blocks, these data are split in the same way by MapReduce processes. These blocks are stored on different nodes (slave machines).

Task Distribution: The generated MapReduce tasks are distributed to the nodes where the data is located. The majority of the data required for processing is found on these nodes.

Execution: Each node processes the data distributed to it. Each task works on the respective data fragment. The processes occur in parallel.

Result Aggregation: When the processes are completed, the results are collected and merged. Hive returns the results to the client or to a results table.

We can check our YARN interface to understand how the process is working in the background. When you execute the query,

INSERT INTO test_yarn(id, name) VALUES (2, 'dahbest')

ACCEPTED: When it moves to the ACCEPTED section in YARN, several operations are performed: Control, Authority, Priority (Queue), Available resource for, Request, and Data location. If the operations are appropriate, it accepts the request.

RUNNING: After YARN accepts it, the RUNNING process begins. It goes to the Worker Nodes to request resources. During this time, each worker node is in communication with each other, so it collects the fragmented tasks and returns them to the Resource Manager.

FINISHED: In the final step, it completes the process and returns the result.

By the way, this process is the same for Apache Spark. Please don’t limit your thinking to just Hive. :)

Let’s create a simple example to understand this ecosystem better:

Firstly, we need to create a directory in HDFS. I’ll create this directory as a root user. When we join Hadoop, we can specify the path for our directory.

In Hadoop’s HDFS system, you can use the <hdfs dfs> command to interact with the distributed file system. To create a directory in HDFS, the basic command is indeed <hdfs dfs -mkdir -p>, which is similar to the <mkdir -p> command in Linux for creating directories, including parent directories if they don't exist. This is a useful command when setting up directories for Hadoop and Hive operations.

hdfs dfs -mkdir -p /user/root/datasets_medium

Secondly, we need to obtain the “cwurData.csv” dataset (world-university-rankings) from Erkan Şirin ‘s GitHub repository and then place it in our “datasets_medium” directory. To achieve this, you can use commands like <wget> or <curl>to download the dataset from the GitHub URL and then use <hdfs dfs -put> to move it to the “datasets_medium” directory in HDFS. Here's an example of how you can do this:

The <hdfs dfs -put> command is used to move a file into HDFS storage in a Hadoop environment.

Hive

When you type <hive> in your command line, you can access Hive.

If you are familiar with the MySQL interface, you will easily understand Hive. :)

CREATE DATABASES medium_db;

SHOW DATABASES;

USE medium_db;

Then we must create our table.

CREATE TABLE IF NOT EXISTS medium_db.university_rankings (
world_rank INT,
institution STRING,
country STRING,
national_rank INT,
quality_of_education INT,
alumni_employment INT,
quality_of_faculty INT,
publications INT,
influence INT,
citations INT,
broad_impact STRING,
patents INT,
score FLOAT,
year INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS textfile
tblproperties('skip.header.line.count'='1');

You can check your table’s information:

DESCRIBE medium_db.university_rankings;

Explanation:

  • ROW FORMAT DELIMITED: This clause specifies that the data in the table is delimited, meaning it uses a specific character to separate fields.
  • FIELDS TERMINATED BY ‘,’: This line specifies that the columns in the data are separated by commas (,). It tells Hive how to split the data into columns.
  • LINES TERMINATED BY ‘\n’: This line specifies that lines (rows) in the data are terminated by newline characters (\n). It tells Hive how to separate rows in the data.
  • STORED AS textfile: This line specifies that the data format of the table is plain textfile. This is suitable for CSV data where fields are separated by commas.
  • tblproperties(‘skip.header.line.count’=’1'): Hive that the first line (columns names) in the data source (usually a CSV or text file) should be skipped when loading data into the table.

We have a database and a table. Right now we can insert to the dataset into our table.

LOAD DATA INPATH '/user/root/datasets_medium/cwurData.csv' INTO TABLE medium_db.university_rankings;

SELECT * FROM university_rankings LIMIT 10;

I believe Hive can be quite challenging when using the <LOAD DATA INPATH…> command, as even minor errors can result in significant problems during data insertion. Your table might return all null values!

Now let’s take a look at the dataset we transferred from the container to Hadoop again using the <hdfs dfs -ls> command.

When we use the <insert into> query, Hive takes the .CSV file to its warehouse. Let's check Hive's warehouse.

hdfs dfs -ls /user/hive/warehouse/DATABASE_NAME.db/TABLE_NAME

Why did Hive move our CSV file to the warehouse?

In the Apache Hive, the term “Hive Warehouse” refers to a specific directory or location where Hive manages and stores its data, both for Internal (managed) tables (we’ll talk about Internal and External table types) and for temporary data storage during query processing. The Hive Warehouse plays a crucial role in Hive’s operation. Here’s what it does:

Data Management and Organization: Data warehouses are used to consolidate data from various sources into a single central repository. This enables better management and organization of data. Data from different sources, such as .CSV files, can be merged and organized within the data warehouse.

Performance and Ease of Operations: Data warehouses are optimized for managing large datasets. Data can be indexed, partitioned, and compressed to accelerate queries and simplify analysis. This allows for faster access to and processing of data.

Data Integration: Data warehouses facilitate data integration by bringing together data from different sources. This makes it easier to consolidate data from different systems and enables easier access to data across your entire organization.

Data Storage and Security: Data warehouses can provide data backup, security, and recovery services. This prevents data loss or corruption.

Creating Metadata Accompanying Data: Data warehouses can generate metadata related to the data. Metadata is data that describes the meaning and structure of the data. Metadata facilitates data discovery and usage.

Optimized for Querying and Analysis: Data warehouses are optimized for fast querying and analysis of data. Loading data into a data warehouse enables these data to be queried more quickly and efficiently.

You can still ask me why such a situation exists in the Hive structure. I can find out why it needs to delete the .CSV file in our Hadoop Distributed File System (HDFS) and move it to the warehouse.

Because hive have two table types,

  • Internal Tables:
  • External Tables:

When we use <CREATE TABLE…> command hive’s defaults table type is Internal, that’s why hive is delete and taking .csv file warehouse. Let’s provide an explanation of internal tables to better understand our question.

Internal Table:

Managed by Hive: Internal tables are also known as managed tables. When you create an internal table, Hive takes full control of managing both the metadata (table structure) and the data itself.

Storage Location: The data for internal tables is stored in a Hive-managed directory in Hadoop Distributed File System (HDFS). Hive decides where to store this data, and it’s typically stored in a directory within the Hive warehouse directory.

Data Deletion: If you drop an internal table, Hive will also delete the associated data from the HDFS location. This means that the data is tightly coupled with the table, and it can be lost if the table is dropped.

That’s why our .CSV file is moved to the Hive warehouse. :) However, we should also learn about External Tables; let’s explain those too.

External Table:

Not Managed by Hive: External tables are tables where Hive manages only the metadata (table structure) but not the data itself. The data is stored externally in a location specified by the user.

Storage Location: The data for external tables is stored outside of the Hive-managed warehouse directory. Users define the storage location, which can be in HDFS, a different Hadoop-compatible file system, or even on a local file system.

Data Persistence: When you drop an external table, only the metadata is deleted, not the data itself. This means the data remains intact in the specified external location. External tables provide a layer of separation between the metadata and data.

Les’t make a simple example:

Creating External Table:

CREATE EXTERNAL TABLE IF NOT EXISTS medium_db.university_rankings_external (
world_rank INT,
institution STRING,
country STRING,
national_rank INT,
quality_of_education INT,
alumni_employment INT,
quality_of_faculty INT,
publications INT,
influence INT,
citations INT,
broad_impact STRING,
patents INT,
score FLOAT,
year INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS textfile
LOCATION '/user/root/datasets_medium'

P.S. If you use Amazon Athena, we’re also using these table types there! :)

Load the dataset:

LOAD DATA INPATH '/user/root/datasets_medium/cwurData.csv' INTO TABLE medium_db.university_rankings_external;

Most of the column names came as Null, and the difficulty I mentioned earlier with the <LOAD DATA…> command was related to this. Anyway, let's get back to our discussion about External tables. :)

When we load the .CSV file, Hive moves our .CSV file to the warehouse in Internal Tables. Let’s check our path right now.

As you can see, we performed the storage operation not in the warehouse but in our own specified location.

When you try to use the <LOCATION…> command with an Internal table, you will encounter an error.

CREATE INTERNAL TABLE IF NOT EXISTS medium_db.university_rankings_internal (
world_rank INT,
institution STRING,
country STRING,
national_rank INT,
quality_of_education INT,
alumni_employment INT,
quality_of_faculty INT,
publications INT,
influence INT,
citations INT,
broad_impact STRING,
patents INT,
score FLOAT,
year INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS textfile
LOCATION '/user/root/datasets_medium'

I provided a general explanation, and some parts became detailed, but I wanted to touch on different aspects in this article. If you have any new perspectives or notice any logical errors anywhere, please contact me.

I hope you now have a clearer understanding of the Hadoop ecosystem, the differences in Hive and Hive table types. If you want to ask me anything about the Hadoop ecosystem, please feel free to contact me. :)

Thank you…

--

--