Lessons Learned: Using Hive

(How I feel writing HiveQL)

1. Denormalization

This first tip applies to the way you create tables in the Hadoop/HDFS environment. In a traditional database environment, more complex schemas are used and designed using Third Normal Form. With Hadoop you’re less concerned with the size of the tables and more concerned with reducing I/O and joins. In order to optimize for the Hadoop environment, flatten out your table structure, shown above lower access frequency. As an example, if you’re creating an aggregate table for purchases materialize the product names in the table instead of placing the product_ids on the table. The key here is to reduce the number of foreign keys. For further detail explore the Hadoop architecture.

2. Patience

Running queries takes time. If you start typing into Google “Why is Hive…” the autofill suggests “Why is Hive so slow” This isn’t absolute but by comparison to a MPP database, you may find that Hive queries are slower. The logs are helpful in this instance. This will let you know where your query’s progression in relation to number of mappers and reducers. Naturally, You need to have 100% mappers and 100% reducers before you receive results. Whether you’re running queries through Hue (Hadoop User Experience), the Hadoop UI or Command Line, keep an eye on mappers and reducers to let you know whether you have time to get up and get a snack or not.

3. Partitions

The queries won’t run without the partition unless you include this statement:

hive.exec.dynamic.partition.mode = nonstrict

If you’re viewing a table through Metastore Manager, the partition is usually at the end of the table. You can also use the command “show partitions” or describe <table name> to see all columns, including partitions.

4. Congrats

Learning a new environment can take some time. But, once you run your first query in Hive you’ll feel amazing.