In this post you will learn about:
- What Apache Hive is, the Hive architecture, and Hive use cases.
- Using Data Definition Language (DDL) to create new Hive databases and tables with a variety of different data types.
- Creating partitioned tables that are optimized for Hadoop.
- Creating and running a variety of useful Data Manipulation Language (DML) queries against Hive data warehouse.
- Using built-in Hive operators and functions to get work done.
- Creating your own user-defined functions in Hive.
Hive was initially developed by Facebook in 2007 to help the company handle massive amounts of new data.
- At the time Hive was created, Facebook had a 15TB dataset they needed to work with. A few short years later, that data had grown to 700TB. Their RDBMS data warehouse was taking too long to process daily jobs so the company decided to move their data into the scalable open-source Hadoop.
- The company found that creating MapReduce programs was not easy and was time-consuming for many users. When they created Hive, their vision was to bring familiar database concepts to Hadoop, making it easier for all users to work with. In 2008 Hive was open sourced. Facebook has since used Hive for reporting dashboards and ad-hoc analysis.
So what exactly is Hive?
Hive is a data warehouse system built on top of Hadoop. Hive facilitates easy data summarization, ad-hoc queries, and the analysis of very large datasets that are stored in Hadoop.
Built on top of Apache Hadoop™, Hive provides the following features:
- Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
- A mechanism to impose structure on a variety of data formats
- Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™
- Query execution via Apache Tez™, Apache Spark™, or MapReduce
- Procedural language with HPL-SQL
- Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
Hive provides a SQL interface, better known as HiveQL or HQL for short, which allowsfor easy querying of data in Hadoop. HQL has its own Data Definition and DataManipulation languages which are very similar to the DML and DDL many of us alreadyhave experience with.In Hive, the HQL queries are implicitly translated into one or more MapReduce jobs,shielding the user from much more advanced and time consuming programming.Hive provides a mechanism to project structure (like tables and partitions) onto the datain Hadoop and uses a metastore to map file structure to tabular form.
Hive is not a real-time processing system and is best suited for batch jobs and huge datasets. Think heavy analytics and large aggregations. Latencies are often much higher than in a traditional database system. Hive is schema on reading which provides for fast loads and flexibility, at the sacrifice of query time.Hive lacks full SQL support and does not provide row-level inserts, updates or deletes.Hive does not support transactions and has limited subquery support. Query optimization is still a work in progress too.If you are interested in a distributed and scalable data store that supports row-level updates, rapid queries, and row-level transaction, then HBase is also worth investigating.
Let’s compare Hive to a couple of common alternatives.
An example often used is that of the Word Count program. The Word Count program is meant to read in documents on Hadoop and return a listing of all the words read in along with the number of occurrences of those words. Writing a custom MapReduce program to do this takes 63 lines of code.Having Hive perform the same task only takes 7 easy lines of code!Another Hive alternative is Apache Pig. Pig is a high-level programming language, best described as a “data flow language” and not a query language. Being a custom language means there is a higher learning curve for SQL programmers to become comfortable with the Pig language. Pig has powerful data transformation capabilities and is great for ETL.It is not so good for ad-hoc querying. Pig is a nice complement to Hive and the two are often used in tandem in a Hadoop environment.
Now let’s take a look at Hive’s architecture. There are a variety of different ways that you can interface with Hive.You can use a web browser to access Hive via the Hive Web Interface.You could also access Hive using an application over JDBC, ODBC, or the Thrift API, each made possible by Hive’s Thrift Server referred to as HiveServer. HiveServer2 was released in Hive 0.11 and serves as a replacement for HiveServer1, though you still have the choice of which HiveServer to run, or can even run them concurrently. HiveServer2brings many enhancements including the ability to handle concurrent clients and more.
Hive also comes with some powerful Command Line interfaces (often referred to as the“CLI”). The introduction of HiveServer2 brings with it a new Hive CLI called Beeline, which can be run in embedded mode or thin client mode. In thin client mode, the BeelineCLI connects to Hive via JDBC and HiveServer2. The original CLI is also included withHive and can run in embedded mode or as a client to the HiveServer1.Hive comes with a catalog known as the Metastore. The Metascore stores the system catalog and metadata about tables, columns, partitions and so on. The metastore makesmapping file structure to a tabular form possible in Hive.
A newer component of Hive is called HCatalog. HCatalog is built on top of the Hivemetastore and incorporates Hive’s DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands. Essentially, HCatalog makes it easier for users ofPig, MapReduce, and Hive, to read and write data on the grid.The Hive Driver, Compiler, Optimizer, and Executor work together to turn a query into a set of Hadoop jobs.The Driver piece manages the lifecycle of a HiveQL statement as it moves through Hive.It maintains a session handle and any session statistics.The Query Compiler compiles HiveQL queries into a DAG of MapReduce tasks.The Execution Engine executes the tasks produced by the compiler in proper dependency order. The Execution Engine interacts with the underlying Hadoop instance, working with the Name Node, Job Tracker and so on.
In Hive data is organized into the following units.
- First data is organized into Databases which are namespaces that separate tables and other data units from naming conflicts.
- Next data is organized into tables which are homogenous units of data that have the same schema.Data can then be organized into partitions, though this is not a requirement.
A Partition in Hive is a virtual column that defines how data is stored on the file system based on its values. A table can have zero or more partitions. Finally, in each partition, data can be organized into Buckets based on the hash value of acolumn in the table.Again, note that it isn’t necessary for tables to be partitioned or bucketed in Hive,however these abstractions allow the system to prune large quantities of data duringquery processing, resulting in faster query execution and reduced latencies.
Data Files in Hive
Data files in Hive are stored right in the Hadoop file system. A variety of different storage and record formats can be used for your data. The internal format of your data can really vary from table-to-table depending on how you want to set things up.
In Hive your data is stored in the warehouse directory as specified in your configuration file unless you override this location.
Hive has a variety of primitive data types available — though not as many as the typical RDBMS has.
There are multiple Integer types, including TINYINT, SMALLINT, INT and BIGINT. There is a BOOLEAN type which can be True or False. There are three floating point number types, specifically FLOAT, DOUBLE, and DECIMAL. Hive has two STRING types — String and VARCHAR. There are two Date/Time formats –TIMESTAMP and DATE. Hive also has a Binary type.
Hive Example Query Catalog
Function MySQL Hive Retrieving Information (General)
SELECT from_columns FROM table WHERE conditions;
SELECT from_columns FROM table WHERE conditions; Retrieving All Values
SELECT * FROM table;
SELECT * FROM table; Retrieving Some Values
SELECT * FROM table WHERE rec_name = "value";
SELECT * FROM table WHERE rec_name = "value"; Retrieving With Multiple Criteria
SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 ="value2";
SELECT * FROM TABLE WHERE rec1= "value1" AND rec2 ="value2"; Retrieving Specific Columns
SELECT column_name FROM table;
SELECT column_name FROM table; Retrieving Unique Output
SELECT DISTINCT column_name FROM table;
SELECT DISTINCT column_name FROM table; Sorting
SELECT col1, col2 FROM table ORDER BY col2;
SELECT col1, col2 FROM table ORDER BY col2; Sorting Reverse
SELECT col1, col2 FROM table ORDER BY col2 DESC;
SELECT col1, col2 FROM table ORDER BY col2 DESC; Counting Rows
SELECT COUNT(*) FROM table;
SELECT COUNT(*) FROM table; Grouping With Counting
SELECT owner, COUNT(*) FROM table GROUP BY owner;
SELECT owner, COUNT(*) FROM table GROUP BY owner; Maximum Value
SELECT MAX(col_name) AS label FROM table;
SELECT MAX(col_name) AS label FROM table; Selecting from multiple tables (Join same table using alias w/”AS”)
SELECT pet.name, comment FROM pet, event WHERE pet.name =event.name;
SELECT pet.name, comment FROM pet JOIN event ON (pet.name =event.name)
Here are the official Hive tutorial and language manual.