Big Data from A to Z. part 5.2: Advanced Options of Hive

This article will continue reviewing the opportunities found in Hive, a drive transmitting SQL-like queries into MapReduce tasks.

The previous article focused on Hive’s basic capabilities like creating tables, uploading data, and executing simple SELECT queries. Now it’s time to talk about more advanced options that may get the most out of Hive.

User Defined Functions

One of the basic obstacles you may face while working with Hive is the confinement of standard SQL’s limits. It may be solved with language extensions, so-called User Defined Functions. There are many useful functions built in Hive already. Here are some of the most interesting in my opinion (information is taken from the white paper).


When working with Big Data, it’s quite often that you have to process non-structured data stored in json format. In order to work with json, Hive supports a specific method called get_json_object, which allows for withdrawal of values from json documents. In order to withdraw values from an object, it uses a limited version of JSONPath notion. It supports the following operations:

  • $: return root object
  • .: return child object
  • []: addressing by index in an array
  • *: Wildcard for

Examples of work with Json from the official white paper:

There is a table: src_json consisting of one column (json) and one line:


Examples of queries to the table:

hive> SELECT get_json_object(src_json.json, '$.owner') FROM src_json;

hive> SELECT get_json_object(src_json.json, '$.store.fruit\[0]') FROM src_json;

hive> SELECT get_json_object(src_json.json, '$.non_exist_key') FROM src_json;


Similarly, if data that have to be processed with Hive are stored in XML instead of json, they may be processed with xpath function, which allows one to parse XML with a relevant language. Here’s an example of parsing xml data with xpath:

hive> select xpath('<a><b>b1</b><b>b2</b></a>','a/*/text()') from sample_table limit 1 ;

Other useful built-in functions:

A built-in library contains a great selection of built-in functions. There are several groups therein:

  • Mathematical functions (sin, cos, log, …)
  • Time functions (from_unix_timestamp, to_date, current date, hour(string date), timediff, …) — overall, a great selection of functions for conversion of date and time
  • Functions for lines. Both generally applicable functions like length, reverse, or regexp, and specific ones, like parse_url or get_json_object reviewd above are supported
  • Lots of various system functions, like current_user, current_database, …
  • Cryptographic functions, like sha, md5, aes_encrypt, aes_decrypt…

Follow the link to see the full list of functions built in Hive.

Writing your own UDF

Functions built in Hive don’t always suffice to cover some tasks. If there’s no built-in function of a kind, you may write your own UDF. Use Java to do so.

Let’s review creating your own UDF for a simple function of lowercase transformation of a line:

1. Create a package com/example/hive/udf and create class within:

mkdir -p com/example/hive/udf
edit com/example/hive/udf/

2. Implement the class Lower:

package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public final class Lower extends UDF {
public Text evaluate(final Text s) {
if (s == null) { return null; }
return new Text(s.toString().toLowerCase());

3. Add the necessary libraries in CLASSPATH (links to jar files may look somewhat different in your hadoop distributive file):

export CLASSPATH=/opt/cloudera/parcels/CDH/lib/hive/lib/hive-exec.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common.jar

4. Compile your UDF and assemble a jar archive:

javac com/example/hive/udf/
jar cvf my_udf.jar *

5. You have to explicitly declare a function to use it in Hive:

hive> ADD JAR my_udf.jar;
hive> create temporary function my_lower as 'com.example.hive.udf.Lower';
hive> select my_lower('HELLO') from sample_table limit 1;

Transforming a table with scripts

Another method of expanding Hive’s standard functionality is using TRANSFORM method, which enables data transformation using customized scripts on any language (this might fit those who hate Java and don’t want to write UDF’s using it).

The syntax of the command is as follows:

SELECT TRANSFORM(<columns>) USING <script> as <new_columns>

<script>: in this case, it’s a program obtaining data on stdin, transforming them, and giving out in a transformed form on stdout. In fact, it’s quite similar to streaming interface for MapReduce launch described in “Big Data from A to Z. Part 2: Hadoop”.


We have a table with user wages specified in different currencies::

| | user_salary.salary | user_salary.currency |

| alexander | 100000 | RUB |
| evgeniy | 4000 | EUR |
| alla | 50000 | RUB |
| elena | 1500 | EUR |

Our mission is to have a table specifying wages in rubles for all users. To do that, we shall write a script in Python to transform the data:х:

import sys
for line in sys.stdin:
name, salary, currency = line.rstrip("\n").split('\t')
if currency == 'EUR':
print name + "\t" + str(int(salary) * EXCHANGE_RATE)
print name + "\t" + salary

The script implies that the data are input in tsv format (columns are separated with tabs). In case there is a NULL value in the table, the value of ‘\N’ will go to the input.

Now we use the script to transform the table.

0: jdbc:hive2://localhost:10000/default> select 
transform(name, salary, currency)
using 'python' as (name, rub_salary)
from user_salary;
| name | rub_salary |

| alexander | 100000 |
| evgeniy | 300000 |
| alla | 50000 |
| elena | 112500 |

По сути испольIn fact, using TRANSFORM allows for complete replacement of classic MapReduce with Hive


As we wrote in the article on techniques and strategies for MapReduce, JOIN of two tables generally requires several MapReduce tasks. As Hive works on MapReduce, JOIN is also an intense operation in this case.
However, if one of two tables to be joined may be completely placed in each node’s memory, we may do with one MapReduce by uploading the table to the memory. This pattern is called MapJoin. We have to give a hint to Hive so that it would use MapJoint (in Hive terms, it’s called ‘hint’).


SELECT /*+ MAPJOIN(time_dim) */ COUNT(*) from
store_sales JOIN time_dim on (ss_sold_time_sk = t_time_sk)

In this example, we imply that store_sales is a big table, while time_dim is a small and may fit in the memory. /*+ MAPJOIN(time_dim) */ is the hint to Hive that MapJoin task is to be launched.

Transaction model

Transaction model ACID implies the support for four basic features, which are as follows:

  • Atomicity: an operation either executes completely and changes all data, or fails and leaves nothing behind
  • Consistency: when an app executes an operations, the result becomes available to all subsequent operations
  • Isolation: operations for one user has no side effects on other users
  • Durability: changes made by a successful operation remain valid even in case of a system failure

Generally speaking, Hive does not work well with alternating data. However, there are a few cases where support for alternating data is required. First of all, it is the following:

  • Data added through streaming (from systems like flume or kafka). It’s desired that the data are available for analysis in Hive as soon as they’ve arrived
  • Scheme update, for instance adding a new column to Hive table. It’s desired that the column is either successfully added to each entry, or fails and doesn’t add to any of them
  • Sometimes separate entries still have to be updated

For those purposes, Hive, starting from version 0.14, has supported transaction model, which is implemented by three operations, namely INSERT, UPDATE, and DELETE.

Support for those operations is very limited:

  • Currently, only ORC files are supported
  • Transactions support is discontinued by default. You have to make relevant changes in Hive’s config file.
  • Support for commands BEGIN, COMMIT and ROLLBACK, which are standard for relation databases, is absent.

Transactions support is implemented with delta files. It means that, while executing the update, data in the original file do not get updated, but a new file noting the changed lines is created instead. Later on, Hive unifies them using compaction operation (hbase uses a similar one).

All in all, as support of transactions is strictly limited, you should seriously weigh using this functionality in Hive. Your other option might be HBase or traditional relation databases.


This and previous articles of the series featured the main opportunities found in Hive, a powerful tool making things easier when working with MapReduce tasks. Hive is a great choice for analysts who got accustomed to SQL; it may be easily integrated in existing infrastructures with JDBC drivers support; and, considering the support of User Defined Functions and customized transformations, it allows for complete transfer of data processing from classic MapReduce. However, Hive isn’t a silver bullet. When it comes to frequently updated data, you may consider tools like Hbase and classic relation databases.

The next articles of the series will continue reviewing tools for Big Data and methods of their processing.