Writing UDF To Parse JSON In Hive
Sometimes we need to perform data transformation in ways too complicated for SQL (even with the Custom UDF’s provided by hive). Let’s take JSON manipulation as an example.
JSON is widely used to store and transfer data. Hive comes with a built-in json_tuple() function that can extract values for multiple keys at once. But if you have a nested JSON the query using json_tuple() can get messy very quickly. Say we have the following JSON:
If we were to use the json_tuple() function of Hive we would write something like:
Which, you have got to agree, looks really ugly!
Instead, we could quickly write a script (python being my fav.) that does the transformation for us. The input to the script is a single record of json from the table, and the output of the script should be tab separated values. The values can be operated upon or inserted into another table using hive.
Here’s the script. The script calls transform_json() method for each line, which extracts the values we are interested in and prints them with tab as the separation character.
In the hive script we first add the file and then call it using the transform() command. Here’s a neat looking script that does what our previous hive script did using json_tuple().
That’s it! Neat and simple!
In my particular case the keys were variables and I just could not have used json_tuple() to extract info from json.