A walk through on Hive.

8 min readMay 1, 2022

Hello everyone, Today in this article let me take you to a new journey on the topic Hive.

I understand that many people are unfamiliar with this topic; but, don’t worry, we’ll have a good understanding of it by the end of this post. Let’s get going : )

What is Hive?

It is data warehouse infrastructure tool which is used for processing the structured data in hadoop. It is used to summarize the Big data and makes queries and analyzing easier.

What is the relation between Hadoop and Hive?

Hive is a Hadoop-based programme that provides a SQL-like interface for data processing and querying. Which means Hive runs over the Hadoop.

If you want to understand more about Hadoop, go to this page.

A quick Intro on Hadoop

Hello everyone,In this blog we will know about Hadoop architecture.

medium.com

Benefits of Hive:

→Fast

→provides a familiar, SQL-like interface

→Scalable

→Fault Tolerance

Ways to process Hive data:

MapReduce application
HiveQL

In this blog we are going to discuss about HiveQL in depth.

HiveQL:

HiveQL is Hive Query Language. It is a query language for Hive which is used to process and analyze structured data in a metastore.

It is very helpful and separates users from the complexity of MapReduce.

It uses common concepts such as tables, row, schema…etc from relational databases.

Limitations of Hive:

→ It is not designed for OLTP.

→Does not support delete and update operations.

→The latency in the apache hive query is very high.

→ Subqueries are not supported.

Commands in Hive:

Before starting the commands make sure you have downloaded the Hive from this link: http://apache.petsads.us/hive/hive-0.14.0/

Step 1:First, download the Hive 3.1.2 from this link.

Step 2: Locate the apache-hive-3.1.2-bin.tar.gz file in your system.

Step 3: Extract this tar file using the below command:

==> tar -xzf apache-hive-3.1.2-bin.tar.gz

That’s it.

Now before getting started with Hive commands, it is mandatory for us to start our daemons (i.e) NameNode, DataNode, Resource Manager, Node Manager and Secondary NameNode.

>>start-all.sh

And using the >>jps command we can check whether all the daemons are started.

Thus are daemons are started.

>>hive

To start our Hive.

Now we can type our Hive commands.

1.To create a database.

>>create database database_name;

Note: Don’t forget to end your commands with a semicolon

To check the list of databases we have:

>>show databases;

2.To delete a particular database.

>>drop database_name;

Now let’s check our database list:

Thus we were able to delete our database.

3.Create Table:

In Hive we have two types of table

→Internal Table

→External Table

Internal Table: The internal tables are also known as managed tables because the Hive controls the lifecycle of their data. These tables are typically stored in a subdirectory under the directory specified by hive.metastore.warehouse.dir (i.e. /user/hive/warehouse). Internal tables are not adaptable enough to be shared with other tools such as Pig. When we attempt to delete the internal table, Hive deletes both the table schema and the data.
Let’s create an managed table:

>>use database_name

>>create table table_name (column_name datatype,……….)Row format delimited fields terminated by ‘,’ stored as textfile

Row Format– Specifies the format of the row.

fields terminated by ‘,’– By default Hive use ^A field separator, To load a file that has a custom field separator like comma, pipe, tab use this option.

Location – You can specify the custom location where to store the data on HDFS.(optional)

Stored as textfile -To specify in which format our data has to be stored.

Thus a table named “student_data” is created.

Using the below command we will be able to get the list of tables that have be saved.

>>show tables;

4.To get a detailed table information:

>>desc table_name ;

example: for a table named employee whose columns names are as given below.

If we need to get more detailed information about the table.

>>desc extended table_name;

5.To load data from local to the table in Hive:

We can use the Insert method or by passing a file which contains the data in the local directory.

Let us consider that we have created a table names employee with column names as employee id, Name ,department, state and salary.

Now we are going to load a file named emp_details.

>>LOAD DATA LOCAL INPATH ‘/home/hive/emp_details’ INTO TABLE emp.employee;

To view the table;

>>show * from employee;

If the data is of unmatched datatype then the value is automatically saved as Null in the table.

Insert command:

>>INSERT INTO TABLE <table_name> VALUES (<add values as per column entity>);

6.Alter

It is used to alter a table in Hive.

ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec …])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec …])

Now, change the name of the table by using the following command: -

>>Alter table emp rename to employee_data;

Now lets check the table:

7.Drop the table:

When you remove a table from Hive Metastore, you are removing the table/column data as well as their metadata.

>>drop table table_name;

Now lets use the following comment.

>>drop table new_employee;

External Table: An external table is one that Hive does not manage storage for. When you delete an external table, Hive only deletes the definition. The data is still there. A Hive internal table is one that Hive manages. When you delete an internal table, both the Hive definition and the data are removed.
To create an external table we use an keyword external

>>create external table if not exists [external-table-name] ( [column1-name] [column1-type], [column2-name] [column2-type], …) comment ‘[comment]’ row format [format-type] fields terminated by ‘[termination-character]’ stored as [storage-type] location ‘[location]’;

Let’s create an external table.

Now lets load the data file country.csv from hdfs to the table countries_list.

Step1 :Let us prepare a data file. With the below content in it.

1,USA,Washington,328

2,France,Paris,67

3,Spain,Madrid,47

4,Russia,Moscow,145

5,Indonesia,Jakarta,267

6,Nigeria,Abuja,196

Save the file and make a note of its location.

Step 2: Import the File to HDFS

1. Create an HDFS directory. You will use this directory as an HDFS location of the file you created.

hdfs dfs -mkdir [hdfs-directory-name]

2. Import the CSV file into HDFS:

hdfs dfs -put [original-file-location] [hdfs-directory-name]

3. Use the -ls command to verify that the file is in the HDFS folder:

hdfs dfs -ls [hdfs-directory-name]

The output displays all the files currently in the directory.

Step 3:Using the load command we should get the data from hdfs and store it in the table.

>>LOAD DATA INPATH ‘/user/test_countries/countries.csv’ INTO TABLE coutries_list;

Now let’s check our table.

Thus our data is loaded into the table.

We can also create an external table by directly mentioning the location.

Now, we can clearly see in the above image that the file contains data, but we haven’t uploaded anything to it, so how is this possible? It is possible because EXTERNAL tables are designed to take data from the specified location. In our case, the location is /user/test_countries, and we have a countries.csv file inside, so Hive external tables will automatically pull data from this location.