Hive Architecture Demystified: How It Works Internally and Its Components

Maan Singh
4 min readApr 27, 2023

--

As a big data enthusiast, I am always looking for ways to store, manage, and analyze large volumes of data. One tool that has caught my attention is Apache Hive. It is a data warehousing tool built on top of Hadoop that provides a SQL-like interface to query and analyze data stored in Hadoop Distributed File System (HDFS). In this blog post, I will provide you with a comprehensive guide to Hive architecture, how it works internally, and its components.

What is Hive Architecture?

Hive architecture is a combination of following main components:

1. Hive Client

2. Hive Services

3. Processing and Resource Management

4. Distributed Storage

hive architecture and components

Hive Client

Client in Hive is the interface that allows users to interact with Hive. Hive supports applications written in any language like Python, Java, C++, Ruby, etc. using JDBC, ODBC, and Thrift drivers, for performing queries on the Hive.

Hive clients are categorized into three types:

1. Thrift Clients — Based on Apache Thrift

2. JDBC Clients — JDBC drivers to connect with Java Applications

3. ODBC Clients — ODBC drivers to connect with applications based on ODBC protocols

Hive Services

This layer provides different services to perform all queries in hive.

The different services offered by hive are:

1. Beeline

It is a CLI (Command Line Interface) supported by HiveServer2, where the user can submit its queries to the system. It is a JDBC client that is based on SQLLINE CLI.

2. Hive Server 2

Hive Server 2 is the successor of Hive Server 1(or Thrift Server). HiveServer2 enables clients to execute queries against the Hive. Multiple clients can submit requests to hive at a time.

It also provides best support for open API clients like JDBC and ODBC.

3. Hive Driver

The Hive driver receives the HQL statements submitted via command line.

The driver is responsible for parsing the query provided by the user, generates a session handle/logical plan for the query and send it to the compiler, optimize the plan, and submits the final generated plan to the execution engine.

4. Hive Compiler

Hive compiler parses the query. It performs semantic analysis and check types on the query block and query expression by using metadata stored in hive external metastore and generates an execution/physical plan.

The execution plan is DAG (Directed Acyclic Graph), where each stage is a map/reduce job.

5. Optimizer

Optimizer performs the transformation operations on the execution plan and splits the task to improve efficiency and scalability.

6. Execution Engine

The Engine is responsible for executing the physical/execution plan, after compilation and optimization steps, generated by Compiler.

7. Metastore

Hive Metastore is a central repository that stores the metadata about hive tables and partitions, including column and column type information.

It also stores information of serializer and deserializer, which is required for Read/Write operations, and the HDFS files were the actual data is kept.

The metastore is generally a Relational Database management system.

It provides Thrift interface for querying and manipulating Hive metadata.

Processing Framework and Resource Management

Hive uses MapReduce framework as internal engine for executing hive queries.

MapReduce is a software framework for writing those applications that process a massive amount of data in parallel on the large clusters of commodity hardware. MapReduce job works by splitting data into chunks, which are processed by map-reduce tasks.

Distributed Storage

Hive is built on top of Hadoop; hence it uses HDFS (Hadoop Distributed File System) as its distributed storage.

Hive Working Internally

When a user submits a query to Hive, several steps happen internally, including:

executeQuery: The user interface calls the execute interface to the driver.

getPlan: The driver accepts the query, create a logical plan and pass the plan to the compiler to generate an execution plan.

Parsing: The Hive compiler parses the SQL-like query and converts it into an internal representation called an Abstract Syntax Tree (AST). It also sends the metadata request to the metastore and the metastore sends the metadata to the compiler.

Semantic Analysis: The compiler performs semantic analysis on the AST to ensure that the query is syntactically and semantically correct.

Query Optimization: The compiler optimizes the query by applying various optimization techniques, such as predicate pushdown, join reordering, and column pruning. The compiler then sends the plan to the driver.

Query Execution: The driver generates a logical execution plan based on the optimized query and submits it to the execution engine. The execution engine converts the logical plan into a physical plan and sends the stages of DAG to appropriate components of Hadoop.

Result Set Processing: Once the execution is complete, the result set is returned to the user through the client interface.

Conclusion

In conclusion, Hive is a powerful data warehousing solution that provides an SQL-like interface to query large datasets stored in Hadoop. The working of Hive involves parsing and optimizing queries, compiling them into physical plans, executing the plans in the Hadoop cluster, and returning the results to the user. The Hive Metastore plays a crucial role in storing metadata information about the data, which is used by the query engine to validate and optimize the queries. With this knowledge, you can now dive deeper into Hive and use it to analyze your big data.

--

--

Maan Singh

Data Engineer || Tech Enthusiast || feeding new data in my system each day