Clarification of the flow Hive Architecture and Queries
Introduction
Hello all, I’m going to introduce “Clarification of the flow Hive Architecture and Queries”. As a prerequisite;
1- ECS (image of Windows machine) will be install into the Huawei Cloud (x region)
2- VPC will be install into the Huawei Cloud (must be x region)
3- Hadoop will be install into the Huawei Cloud
4- Hive will be install into the Huawei Cloud
and then we will start to install Hive and investigate the architecture details. Enjoy the reading.☕
What Is Hive?
Large-scale analytics are made possible by the distributed, fault-tolerant data warehousing technology known as Apache Hive. Hive Metastore (HMS) is a crucial part of many data lake systems because it offers a central repository of metadata that can be quickly evaluated to make data-driven decisions. Hive, which is based on Apache Hadoop, enables storage on S3, adls, gs, and other platforms through HDFS. SQL can be used by Hive users to read, write, and manage petabytes of data.
Advantages Of Using Hadoop
✅HS2 (Hive-Server 2): Authentication and multi-client concurrency are supported by HS2. Better support for open API clients like JDBC and ODBC is what it is intended to offer.
✅Hive Metastore Server (HMS): In a relational database, the Hive Metastore (HMS) serves as a central repository for the metadata for Hive tables and partitions. Clients (such as Hive, Impala, and Spark) can access this data through the metastore service API. It has developed into a foundation for data lakes that make use of a wide range of open-source tools, including Apache Spark and Presto. In fact, the Hive Metastore is surrounded by a vast ecosystem of tools, both open-source and proprietary, some of which are depicted in this diagram.
✅Hive ACID: Hive offers insert-only support to all other formats and full acid support for ORC tables out.
✅Hive Data Compaction: Query-based and MR-based data compactions for Hive are supported out of the box.
✅Hive Replication: For backup and recovery, Hive offers bootstrap and incremental replication.
✅Security and Observability: For security and observability, Apache Hive interfaces with Apache Ranger and Apache Atlas and supports Kerberos authentication.
✅Hive LLAP: Low Latency Analytical Processing (LLAP), introduced in Hive 2.0, which makes Hive faster by employing persistent query architecture and enhanced data caching, allows Apache Hive to support interactive and subsecond SQL.
✅Query planner and Cost based Optimizer: To optimize SQL queries, Hive makes use of Apache Calcite’s cost-based query optimizer (CBO) and query execution framework.
So, how did we go about it?
We have a few steps to make a demo about the Hive architecture, these are below:
Step 1: Check Java — Hadoop — Hive versions
Step 2: Understand what kind of architecture we will look at closely
Step 3: Case Study for Hive (Running a Query on Hive)
Let’s start 👇
Step 1: Make sure that Java — Hadoop — Hive versions are available on our machine and the minimum requirements:
For Hadoop:
Hadoop version is minimum 3.x.x
open the cmd and write:
hadoop -version
For Java:
Java version is minimum 1.8.x.xxx
open the cmd and write:
java -version
Here we should see that both setups meet the requirements
For Hive:
Hive version is minimum 3.x.x
open the cmd and write:
hive -version
If everything is ok, follow the step-2 🙂
Step 2: It would be correct to understand what kind of architecture we have to demo and interpret accordingly.
There are 5 parts to Hive, one of which is dependent on the Hadoop framework. Now let’s examine these elements.
1-UI (User Interface): Users submit their inquiries to the system via the UI (User Interface). Users communicate with Hive directly at the End. Here, we type the commands. A Hive query can be sent in three different ways.
The Hive cli (command line interface) is one of them. Utilizing the Hive web interface is the second option. The Thrift server is the third. Thrift server refers to interacting and transacting with Hive through any application’s JDBC or ODBC connections.
2-Driver: The query is obtained by this component from the UI. The query’s initial responsibility is to retrieve the necessary APIs for the query as represented by the JDBC or ODBC interfaces. Converting the Hive query to the MapReduce programme is its second task.
Hive queries are transformed into MapReduce programmes, so the driver, with a little assistance from the compiler, transforms Hive queries into MapReduce programmes.
3-Compiler: Compiler plays a little part in converting MapReduce programmes from Hive queries. However, the compiler also does a semantic analysis of the programme, and then uses the metastore to generate an execution plan.
4-Metastore: Our table's partitions, number of columns, data types, serializers, deserializers, and other structured information are all stored in the metastore. The database is not large. The Metastore contains all information.
The Apache Derby SQL server, which is used by default by Hive as the Metastore (https://db.apache.org/derby/#What+is+Apache+Derby%3F), is not used by us because it offers single-process storage for real-time projects. This means that when utilizing Derby, we are unable to run two Hive CLI instances concurrently.
We use MySQL or another strong database as the metastore in a real-time context, allowing several Hive CLI instances to run at once.
5-Execution Engine: It is the part connected to the Hadoop architecture. executes the execution plan that the compiler produced. To obtain the desired output from HDFS, it talks with NameNode and Resource Manager before returning query results to the user.
Step 3: Case Study for Hive (Running a Query on Hive)
The foundations of this article have been laid on the detailed examination of how a query will work and giving information about it, and it will be tried to be transferred and explained step by step.
Note: This example data’s and other personal information is dummy
SELECT * FROM Huawei.Employees;
According to the architecture, the steps are as follows
1-execututeQuery: Command Line or Web UI sends the query to the Driver (any Hive interface like database driver JDBC, ODBC, etc.) to execute the query.
2-getPlan: The driver takes the help of the query compiler which parses the query to check the syntax and the query plan or the requirement of the query.
3-getMetaData: The compiler sends a metadata request to the Metastore (any database).
4-sendMetaData: Metastore sends the metadata to the compiler in response.
5-sendPlan: The compiler checks the requirement and resends the plan to the driver. The parsing and compiling of a query is complete.
6-executePlan: The driver sends the executing plan to the execution engine.
6.1-metaDataOps (On Hive): Meanwhile in execution, the execution engine can execute metadata operations with Metastore.
6.1-executeJob (on Hadoop): Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in the Name node and it assigns this job to TaskTracker, which is in the Data node. Here, the query executes MapReduce job.
6.2-job done: After the map-reduce process in Hadoop is finished, it sends a message that the process is finished here
6.3-dfsOperations: DFS operations using the client’s reported user and group permissions and worked between Execution Engine and NameNode
8-sendResults: The execution engine sends those resultant values to the driver.
7-fetchResults: The driver sends the results to Hive Interfaces.
9-fetchResults: The driver sends the results to HDFS.
Conclusion
In this process made with SQL query, I tried to explain in a simple way what each step does by going into detail about all the internal mechanisms that take place on Hive and Hadoop.
If you have any thoughts or suggestions please feel free to comment or if you want, you can reach me at guvezhakan@gmail.com, I will try to get back to you as soon as I can.
You can reach me through LinkedIn too.
Hit the clap button 👏👏👏 or share it ✍ if you like the post.