Understanding Hadoop Hive

Published in

The Startup

5 min readMay 13, 2020

Hive is a data warehouse system which is used for querying and analysing large datasets stored in HDFS. It process structured and semi-structured data in Hadoop.

Hive Architecture

Metastore: stores metadata for Hive tables (like their schema and location) and partitions in a relational database(traditional RDBMS format).

Driver: acts like a controller which receives the HiveQL statements. It monitors the life cycle and the progress of the execution of the HiveQL statement. it stores the necessary metadata generated during the execution of a HiveQL statement. it does the compiling and optimizing and executing of the HiveQL statements.

Compiler: It performs the compilation of the HiveQL query. it converts the query to an execution plan which contains the tasks(of mapreduce).

Optimizer: It performs various transformations on the execution plan to provide optimized plan. It aggregates the transformations together, such as converting a pipeline of joins to a single join.

Executor: Once compilation and optimization complete, the executor executes the tasks.

Thrift application: is a software framework which allows external clients to interact with Hive over a network, similar to the JDBC or ODBC protocols.

Beeline: The Beeline is a command shell supported by HiveServer2, where the user can submit its queries and command to the system.

Hive Server 2: enhanced version of Hive Server 1 wich allows multiple clients to submit requests to Hive and retrieve the final results. It is basically designed to provide the best support for open API clients like JDBC and ODBC and Thrift.

The steps to execute the HQL statement

executeQuery: The user interface calls the driver to excute the HQL statement(query).

2. getPlan: The driver accepts the query, creates a session handle for the query, and passes the query to the compiler for generating the execution plan.

3. getMetaData: The compiler sends the metadata request to the metastore.

4. sendMetaData: The metastore sends the metadata to the compiler.

The compiler uses this metadata for performing type-checking and semantic analysis on the expressions in the query tree. The compiler then generates the execution plan (Directed acyclic Graph). For Map Reduce jobs, the plan contains map operator trees (operator trees which are executed on mapper) and reduce operator tree (operator trees which are executed on reducer).

5. sendPlan: The compiler then sends the generated execution plan to the driver.

6. executePlan: After receiving the execution plan from compiler, driver sends the execution plan to the execution engine for executing the plan.

7. submit job to MapReduce: The execution engine then sends these stages of DAG to appropriate components. For each task, either mapper or reducer, the deserializer associated with a table or intermediate output is used in order to read the rows from HDFS files. These are then passed through the associated operator tree.

Once the output gets generated, it is then written to the HDFS temporary file through the serializer. These temporary HDFS files are then used to provide data to the subsequent mapreduce stages of the plan. For DML operations, the final temporary file is then moved to the table’s location.

8,9,10: sendResult: Now for queries, the execution engine reads the contents of the temporary files directly from HDFS as part of a fetch call from the driver. The driver then sends results to the Hive interface.

Hive Data Model

Data in Apache Hive can be categorized into:

Table

Hive tables are the same as the tables present in a Relational Database. In Hive when we create a table, Hive by default manage the data. It means that Hive moves the data into its warehouse directory(we talk about Managed Tables). we can also create an external table, it tells Hive to refer to the data that is at an existing location outside the warehouse directory(we talk about External Table).

Partition

Hive organizes tables into partitions for grouping same type of data together based on a column or one or more partition keys to identify a particular partition.

Bucket

Tables or partition are subdivided into buckets based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries.

Example:

Hive Data Types

Hive Primitive Data Type

Hive Complex Data Type

struct: it’s like the structures in the c language

STRUCT<col_name1 : data_type1, col_name2 : data_type2,...>

union: heterogeneous data types.

UNION< dataType1, dataType2, ...>

Different modes of Hive

hive operates in two modes depending on the number and size of data node.

Local Mode : is used when hadoop is having one data node and the data is small. Processing will be very fast on smaller datasets which are present on local machine.

MapReduce Mode : is used when hadoop is having multiple data nodes and the data is spread across various data nodes. Processing large datasets can be more efficient using this mode.