How To Create Your Own Hive SerDe — Hive Custom Data Serialize-Deserialize Mechanism

Patraporn Kuhavattana
Analytics Vidhya
Published in
3 min readAug 23, 2020

As mentioned in my earlier blog post, SerDe is an interface which hive use to deserialize (read data from table’s hdfs location then converting it to java object) and serialize data (convert a Java Object representing each data record using ObjectInspector to Writable format which can be written to hdfs location).

A figure shows how hive reads and writes records using SerDe. Referenced from http://www.dummies.com/programming/big-data/hadoop/defining-table-record-formats-in-hive/

Type of SerDe

By default, hive uses a SerDe called LazySimpleSerDe:

 org.apache.hadoop.hive.serde2.LazySimpleSerDe 

When you’ve specified a “TEXTFILE” format as a part of “STORED AS” property during a process of table creation. However, there are other type of SerDe including OpenCSV SerDe.

org.apache.hadoop.hive.serde2.OpenCSVSerde

In this case, you can use it when you have to store your table as a csv format. If you use cloudera, it comes with some custom SerDe called JSONSerDe.

com.cloudera.hive.serde.JSONSerDe

Of which have an ability to read and write your data using JSON format.

However, you will get more flexibility if you’re able to create you own SerDe. This can proved to be useful when you want to use it for special use cases.

So the purpose of this article is to help you develop your own custom SerDe.

How to Implement A Custom SerDe

In order to create a custom SerDe you have to instantiate a class which implementing SerDe interface first.

Then instantiating variables as shown below:

Each variable have its own purpose as you can see here:

  1. separatorChar: A variable which represents field separator. When you store a text file at Hive table location, Hive need to know which field separator should be used in order to separate fields correctly. If this parameter is not set, the default separator is set to tab.
  2. rowTypeInfo: It stores an object of type StructTypeInfo which contains information of each field including name and type (TypeInfo).
  3. rowOI: It stores an object of type ObjectInspector which is used as a placeholder for representing each row.
  4. colNames: It stores a list of column names.
  5. row: It stores a list of a java object which contains values for each fields in a row.

Also SerDe interface requires you to implement some necessary methods as in the following in order to make sure it will work properly when executed.

SerDe Required Method

You can see that in order to get your custom SerDe class to work, you have to implement some required methods first. Let’s go through it one by one.

initialize

This method is used to instantiate all important properties such as list of column names, list of column types and an ObjectInspector object which is a placeholder for storing rows and columns information.

deserialize

It is required to convert a Writable byte array format to a java object.

parseField

This method is the most important here as it is used by Hive to get and validate a data type in each data field. So this method is the one which gives you an ability to custom your own data validation rules.

serialize

This method is important for converting a java object to a Writable byte array format.

getObjectInspector

As its name described, this method is used to get an ObjectInspector object which representing each data record.

You can find a completed code below.

Finally, you have to compile your java class into jar file then attached it in your hive query when creating your table.

Conclusion

So that’s all you need to implement your own SerDe. Hopefully, this article will help you create your own custom SerDe and do some data validation whenever you need to. You can do anything with it including implementing json file read and write mechanism as you can see in other articles out there in the internet.

--

--

Patraporn Kuhavattana
Analytics Vidhya

A data scientist who enthusiast to know more about the world. A book lover who interested in literature, science and philosophy genres.