Hadooping: Hive

Hive for High Fives.

History

Basically, Facebook grew quick. They moved to Hadoop but some struggled with the change. You see at that time, Hadoop required that you write all your commands in Java but the reporting and dashboarding people were use to writing SQL. As such, some clever sticks built Hive. It was quickly Open Sourced.

Credit: Matthew T Rader

Why Hive?

Because of its history, Hive is good for queries of large datasets. You say what you want and Hive works it out so you don’t have to.

For example, Word counting is a common task for learn Hadoop. Writing in Java that is at least 63 lines. In Hive, it is 7.

What is Hive?

It is essentially a library of Java files (.JAR), corresponding Hive scripts and config files, with an architecture to let you contact it from many angles.

How to Hive?

Hive allows you to interact with it in many ways: Web Browser, CLI, API (ie. JDBC) or other clients.

You can also use third party programs/platforms. I have used Apache Zeppelin notebooks. These are essentially coding notebooks for Hadoop. What is a coding notebook? Well they are online notebooks which allow both text and code to be placed side by side and hence you can write about what you want, explain all the functions and have you or your views run code directly below the explanation. I couldn’t recommend them enough!

Data separation

In Hive data is separated first by Database. Then in Databases are Tables. Partitions can then be used to separate/index the data. For example partitions could be used to separate data into years and hence be quicker to find date bound queries. Then, in each columns are a load of buckets or clusters. These are good because you may want to use a whole database/table/partition/bucket and you can hence pick exactly the data we want.

We can create these databases as simply as CREATE DATABASE nameOfDataBase. We can also see what databases are already here with SHOW DATABASES. To delete a database we must type DROP DATABASE IF EXISTS nameOfDatabase CASCADE. Note the CASCADE part. This is because you can only delete empty tables and CASCADE will delete all the tables inside in turn.

All of these commands are used in the Hive CLI and these are known as DDL commands — Data Definition Language. These commands are used to define, build and modify data and the tables.

DDL — Add data

So yes you can create a database from the command above but this b***h is empty. Imagine we have a data file that we need in it! We would want to run this command to build a table.

CREATE TABLE myFirstTable
(
id INT, name STRING, city STRING, items ARRAY<STRING> )
PARTITIONED BY (age INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ':'
STORED AS TEXTFILE

This code is readable but is worth explaining. Our data file is delimited by | symbols, opposed to the typical , in CSV for example. We also know that the items are separated by :. We can change these as our data maybe different. We also note the stuff in brackets which lays out the data types for our table. Simple!

All simple except the Partition. Here we pick the data we want to split by and the data type it is. Note that age isn’t labelled in our data scheme as it is used in the partition. That is it.

A call for help.

I have also heard of something new called an Index. I am struggling to see the difference between a partition and an index and would love some help — please comment.

Next question, where are these tables stored?

Well the choice is yours, they can be external or “managed”. By default they are managed.

Managed tables fully controlled by Hive and hence are easy to navigate through.

The problem is that they can’t be shared outside of Hive, and as such it is necessary to make an external table if you want other applications to have access to the data. To do this CREATE EXTERNAL TABLE nameOfTable LOCATION location/of/db.

We have finished setting things up. That means we have finished defining things. That means we have finished with Data Definition Language (DDL).

We have defined, now lets manipulate

Manipulate data that is. It sounds worse than it is.

DML — Moving data in?

If we have a local (on our pc not on HDFS) data file:

LOAD DATA LOCAL INPATH 'path/goes/here/data.dat' OVERWRITE INTO TABLE nameOfTable

Overwrite here is self explanatory, but if you don’t specify this Hive will just append the data.

If you don’t write LOCAL Hive assumes it is within the HDFS and will look for it.

If we have a data that needs to go into a particular partition:

LOAD DATA LOCAL INPATH 'path/goes/here/data.dat' OVERWRITE INTO TABLE nameOfTable PARTITION (condition - 'ofPartition')

With both examples, the file will be loaded into the HDFS area which Hive can control.

DML — Moving data out?

Similar to above:

INSERT OVERWRITE LOCAL DIRECTORY 'please/put/it/here/foldername' 
SELECT sale_id, product, date
FROM sales
WHERE date='2019-01-01'

If you are moving large datasets you should never append to files as that cannot be done easily. Use OVERWRITE instead to speed things up. Be careful not to overwrite things you want.

DML — Give me the data

Now that we have loaded it we want to be able to access it. We use the SELECT command here.

SELECT * FROM nameOfTable LIMIT 3

This will pull the first three rows of data in our table. We can be more specific too and demand for particular info types.

SELECT id, name FROM nameOfTable LIMIT 3

I want to be mooooorree specific though. Well you can also use WHERE.

SELECT id, name FROM nameOfTable 
WHERE name = 'acgoff' ;

We can set up a variety of if-esque statements. More info is here. But in short it has everything you would expect it to and more. Stackoverflow will help with any odd thing. LIKE and RLIKE are interesting ones to check out. Also it allows you to get information from arrays (search complex operator types). Oh and you also can create your own operators (but you have to write them in Java).

You can also order your results using GROUP BY:

SELECT id, name FROM nameOfTable 
WHERE name = 'acgoff'
GROUP BY city;

Partitions finally show up!

So now we are querying our data we can see why partitions help. They reduce down the amount of data we will be looking at during a query. For example, if we partition on date, and we query for a particular data range, we can ignore all files outside that range. If we didn’t partition we would have to look through each individually which could be a huge task.

Join your data

Sometimes you may have data you want to query on in two different tables. In this case we would have to use a JOIN command. The two tables need a common id to join the data correctly. An example, Tutorials Point gives this example. (I would also recommend a read of the tutorial).

SELECT c.ID, c.NAME, c.AGE, o.AMOUNT 
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);

In this example you have a table called c and another called o. As quick as that you get multi-table responses.

If the ids don’t join there are a huge possibility of things to do — but this is above what I am going into. If you want them look at Map Joins or Bucket Map Joins.

You can also transform data types. For example, some numbers may come in as STRING but you want it as FLOAT. In your WHERE part of the query simply add WHERE cast(numberString AS FLOAT) = 1.5 (=1.5 is just an example). If the conversion fails (say the string has the word one instead of the value) it will be returned as Null.

What is a View?

So sometimes you may find you always are doing the same sort of queries. These can be saved a views. Views aren’t saved tables of your queries but are the codified query saved and ready to run again on demand. This can be helpful as you may further query a view. This saves the complexity of writing a query within a query as I query a view.

Let’s Review

Hive does a few things:

  • Allows you to build Databases, tables, partitions and buckets easily.
  • Allows you to move data in and out easily
  • Allows you to query the data easily
  • Means you don’t need to know Java to interact with Hadoop!