FUNDAMENTALS OF APACHE HIVE

Mert Goktas
Huawei Developers
Published in
5 min readJul 3, 2023
Hive Logo

Introduction

Hello to everyone. I am going to tell about Apache Hive in this article. I hope this article is helpful to you when using Hive in your Big Data operations. If ready your coffes or teas we can start. Enjoy reading.

What is Hive?

Hadoop keeps data in the form of HDFS. Hive allows querying these HDFS data stored in Hadoop. This query is similar to SQL. In order to examine and process this data received over HDFS, to calculate and perform a series of calculations, Apache has developed a SQL-like query provider, Hive. Hive was originally developed by Facebook, then Apache took over.

Which Datas Proccesing on Hive?

Well, we can store all the data in HDFS. But we do not process all data in HDFS with Hive. Hive can only process configured data. Data can be arranged in tables, rows, columns.Since data processing on Hive is not very fast, the purpose of use should be the amount of data processed, not the speed. Large data can be processed through Hive so it is effective for batch processing. Hive is just a middleman between MapReduvce and HDFS.

Hive is not a database. So, why use Hive? We can more easily write the code of complex functions on MapReduce via Hive. Therefore, Hive can be effective in BigData processes.Hive abstracts us from the HDFS platform and MapReduce codes allow us to query with familiar SQL commands without the need for Java knowledge.

Apache Tez has replaced MapReduce as the Hive execution engine. Executing Hive queries under Tez improves performance with expressions of directed acyclic graphs (DAGs) and data transfer primitives.

SQL queries you submit to Hive are executed as follows:

Hive Queries
  • Hive compiles the query
  • Tez executes the query
  • YARN allocates resources for applications across the cluster and enables authorization for Hive jobs in YARN queues
  • Hive updates the data in HDFS or the Hive warehouse, depending on the table type
  • Hive returns query results over a JDBC connection.

Architecture of Hive

Hive’s Architecture

The architecture of the Hive is as shown below. The architecture of the Hive is as shown below. We start with the Hive client, who could be the programmer who is proficient in SQL, to look up the data that is needed.

1- Hive Clients

Thrift is a software framework. The Hive Server is based on Thrift, so it can serve requests from all of the programming languages that support Thrift.

We have the JDBC (Java Database Connectivity) application and Hive JDBC Driver. The JDBC application is connected through the JDBC Driver.

We have an ODBC (Open Database Connectivity) application connected through the ODBC Driver. All these client requests are submitted to the Hive server.

2- Hive Seervices

We have the Hive web interface, or GUI, where programmers execute Hive queries. Commands are executed directly in CLI. Up next is the Hive driver, which is responsible for all the queries submitted. It performs three steps internally:

Compiler : The Hive driver passes the query to the compiler, where it is checked and analyzed Optimizer : Optimized logical plan in the form of a graph of MapReduce and HDFS tasks is obtained Executor : In the final step, the tasks are executed Metastore is a repository for Hive metadata. It stores metadata for Hive tables, and you can think of this as your schema. This is located on the Apache Derby DB.

3- Proccesing and Resource menagement

Hive uses the MapReduce framework to process queries. We have distributed storage, which is HDFS. If you have read our other Hadoop blogs, you’ll know that these are on commodity machines and are linearly scalable, which means they’re very affordable.

Data Flow in Hive

Hive Data Flow

Data stream in the following order:

  • A query that goes into the drive is executed
  • Next, the driver asks for the plan expressing query execution.
  • After this step, the compiler gets the metadata from the metastore. Metastore responds with metadata.
  • The compiler collects this information and sends the plan back to the driver.
  • The driver sends the execution plan to the execution engine.
  • The execution engine acts as a link between Hive and Hadoop to process the query.
  • The execution engine also communicates bi-directionally with the metastore to perform various operations such as creating and dropping tables.
  • Finally, we have a two-way communication to receive the results and send them back to the client.

Hive Data Modelling

Hive data modeling consisting of tables, partitions, and groups:

Hive’s Data Modelling

Tables

Tables in Hive are created as they are done in RDBMS.

Partitions

Tables are organized into partitions for grouping similar types of data based on the partition key

Buckets Data present in partitions can be further divided into buckets for efficient querying

Hive Data Types

Hive has different data types learned earlier. These are Primitive and Complex data types:

Primitive Data Types

✅ Numeric Data types — Data types like integral, float, decimal

✅ String Data type — Data types like char, string

✅ Date / Time Data type — Data types like timestamp, date, interval

✅ Miscellaneous Data type — Data types like Boolean and binary

Complex Data Types

✅ Arrays — A collection of the same entities. The syntax is: array<data_type>

✅ Maps — A collection of key-value pairs and the syntax is map<primitive_type, data_type>

✅ Structs — A collection of complex data with comments. Syntax: struct<col_name : data_type [COMMENT col_comment],…..>

✅ Units — A collection of heterogeneous data types. Syntax: uniontype<data_type, data_type,..>

Conclusion

We learned Apache Hive fundemantals generaly in this article. We sought answers to questions such as “What is Hive?”, “What is Hive Query?”.

References

--

--