Now HIVE concept made easy !

R RAMYA
9 min readApr 30, 2022

--

Let’s dig deep into the concept of hive…

image

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

In simple words….

Data processing done through query language is called hive.

image

Before further learning, do refer my blog on hadoop and big data https://medium.com/@uec18332/hadoop-architecture-big-data-c56022e40d91 to get the flow of the concept.

You might have certain ideas on what hive is !

But you may not know what hive is not.

image

Hive Architecture

image

HiveQL

The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore.

SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition.

→ It filters the data using the condition and gives you a finite result.

The built-in operators and functions generate an expression, which fulfils the condition.

HiveQL Commands

  1. Create Database Statement

Create Database is a statement used to create a database in Hive. A database in Hive is a namespace or a collection of tables.

Syntax:

CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with the same name already exists.

We can use SCHEMA in place of DATABASE in this command. The following query is executed to create a database named userdb:

hive> CREATE DATABASE [IF NOT EXISTS] userdb;

The following query is used to verify a databases list:

hive> SHOW DATABASES;
default
userdb

2. Drop Database Statement

Drop Database is a statement that drops all the tables and deletes the database.

The following queries are used to drop a database.

The following query drops the database using CASCADE. It means dropping respective tables before dropping the database.

3.Create Table Statement

Create Table is a statement used to create a table in Hive.

Syntax

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

4. Alter Table Statement

It is used to alter a table in Hive.

Syntax

ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

5. Rename To… Statement

The following query renames the table from employee to emp.

hive> ALTER TABLE employee RENAME TO emp;

6. Drop Table Statement

When you drop a table from Hive Metastore, it removes the table/column data and their metadata.

It can be a normal table (stored in Metastore) or an external table (stored in local file system)

→ Hive treats both in the same manner, irrespective of their types.

Syntax

DROP TABLE [IF EXISTS] table_name;

HiveQL — Select-Where

SELECT statement is used to retrieve the data from a table.

WHERE clause works similar to a condition. It filters the data using the condition and gives you a finite result.

→ The built-in operators and functions generate an expression, which fulfils the condition.

Syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];

Example

Let us take an example for SELECT…WHERE clause.

Assume we have the employee table as given below, with fields named Id, Name, Salary, Designation, and Dept.

Generate a query to retrieve the employee details who earn a salary of more than Rs 30000.

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+

Below query retrieves the employee details

hive> SELECT * FROM employee WHERE salary>30000;+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+------+--------------+-------------+-------------------+--------+

HiveQL — Select-Order By

The ORDER BY clause is used to retrieve the details based on one column and sort the result set by ascending or descending order.

Syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];

Example

Let us take an example for SELECT…ORDER BY clause.

Assume employee table as given below, with the fields named Id, Name, Salary, Designation, and Dept.

Generate a query to retrieve the employee details in order by using Department name.

+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+

The below query retrieves the employee details

hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1205 | Kranthi | 30000 | Op Admin | Admin |
|1204 | Krian | 40000 | Hr Admin | HR |
|1202 | Manisha | 45000 | Proofreader | PR |
|1201 | Gopal | 45000 | Technical manager | TP |
|1203 | Masthanvali | 40000 | Technical writer | TP |
+------+--------------+-------------+-------------------+--------+

HiveQL — Select-Group By

The GROUP BY clause is used to group all the records in a result set using a particular collection column.

→ It is used to query a group of records.

Syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];

Example

Let us take an example of SELECT…GROUP BY clause.

Assume employee table as given below, with Id, Name, Salary, Designation, and Dept fields.

Generate a query to retrieve the number of employees in each department.

+------+--------------+-------------+-------------------+--------+ 
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 45000 | Proofreader | PR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+

The below query retrieves the employee details

hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;+------+--------------+ 
| Dept | Count(*) |
+------+--------------+
|Admin | 1 |
|PR | 2 |
|TP | 3 |
+------+--------------+

Tables in Hive:

1. Internal Tables

Hive owns the data for the internal tables.

Internal tables also called as managed tables

It is the default table in Hive. When the user creates a table in Hive without specifying it as external, then by default, an internal table gets created in a specific location in HDFS.

→ By default, an internal table will be created in a folder path similar to /user/hive/warehouse directory of HDFS.

Syntax to Create Internal Table :

CREATE TABLE IF NOT EXISTS stocks (exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

LOAD file from LOCAL file system.

LOAD DATA LOCAL INPATH ‘/home/hive/data.txt’ INTO TABLE emp.employee;

Unlike loading from HDFS, source file from LOCAL file system won’t be removed.

2. External Tables

Hive does not manage the data of the External tables.

→ We create an external table for external use as when we want to use the data outside the Hive.

→ External tables are stored outside the warehouse directory. They can access data stored in sources such as remote HDFS locations or Azure Storage Volumes.

Now, let’s LOAD the data from local file system

LOAD DATA LOCAL INPATH ‘/home/hive/data.txt’ INTO TABLE emp.employee_external;

This will create a folder /user/hive/data/employee_external/ on HDFS and creates a file data.txt.

Let’s check if the file created at Hive external location on HDFS.

hdfs dfs -ls /user/hive/data/employee_external/
-rw-r

Difference between Internal and external tables in hive:

Hive Load CSV File into Table

Use the LOAD DATA command to load the data files like CSV into Hive Managed or External table.

Hive LOAD Command Syntax

LOAD DATA

Hive LOAD CSV File from HDFS

Hive LOAD DATA statement is used to load the text, CSV, ORC file into Table.

→ Load statement performs the same regardless of the table being Managed/Internal vs External Table.

load a data file into the Hive table

  1. Create a data file
  2. Now use the Hive LOAD command to load the file into the table.

“ Note that after loading the data, the source file will be deleted from the source location, and the file loaded to the Hive dataware house location or to the LOCATION specified while creating a table.”

-rw-r

Use SELECT command to get the data from a table and confirm data loaded successfully without any issues.

LOAD CSV File from the LOCAL filesystem

Use LOCAL optional clause to load CSV file from the local filesystem into the Hive table without uploading to HDFS.

Unlike loading from HDFS, source file from LOCAL file system won’t be removed.

Use OVERWRITE clause

Use optional OVERWRITE clause of the LOAD command to delete the contents of the target table and replaced it with the records from the file referred.

LOAD DATA LOCAL INPATH ‘/home/hive/data.csv’ OVERWRITE INTO TABLE emp.employee;

Use PARTITION clause

If you have a partitioned table, use PARTITION optional clause to load data into specific partitions of the table.

→ You can also use OVERWRITE to remove the contents of the partition and re-load.

LOAD DATA LOCAL INPATH ‘/home/hive/data.csv’ OVERWRITE INTO TABLE emp.employee PARTITION(date=2020);

Use INSERT INTO

Like SQL, you can also use INSERT INTO to insert rows into Hive table.

Features of Hive:

The End…

Happy Learning Friends🧡

Cheer ✨

See you in next blog !!!

Ramya R❤…

--

--