Limitation of Hive Data Validation

Patraporn Kuhavattana
Analytics Vidhya
Published in
3 min readAug 22, 2020

In a big data world, hive is one of the most popular data warehouse tool. Though it comes with some convenient and flexibility features including SQL liked data manipulation language or easily data importing mechanism which a user can just simply copy data file to specified hdfs location. It also have some disadvantages which is why it is necessary to have “Data validation” mechanism.

Hive Data Validation Mechanism Limitation

In hive when you create a table schema, you must specify each column name along with its data type. So simple like relational database which are widely used nowadays. You might think that afterward you can just import your data into a table’s hdfs location and use simple query as in the following:

SELECT * FROM [tablename].

But that’s not the case, in this scenario you won’t get an error message if your data isn’t in the same format as data type you’ve specified in a schema like those you saw in relational database. Instead, hive will convert a value that is not compatible with specified schema data type to NULL, which sometimes, undesirable.

Until now, there is no tool which designed to fix this issue. But there are some methods you could use as a workaround.

  1. Preprocessing your data first before moving it into hive. You can do it by writing an ETL process in “Mapreduce” or by using “Pig” (An opensource tool which implementing higher-level mapreduce mechanism). However, if you have so many tables it will be inconvenient to write multiple mapreduce program in order to handle validation process for each table separately. With that in mind, it leads to our second method.
  2. Creating a staging table with all STRING data type. Then use some custom hive function to validate your data type before loading it into another persistence table.

Creating a staging table and applying a custom hive function for data validation

You can follow steps below to create a temporary staging table for storing you raw data then applying a custom hive function for validating before storing it into your persistent hive table.

  1. Create a staging table for storing your raw data.

2. Copy your text data file it your destined table’s hdfs location.

3. Use CAST and NVL function when selecting your data to convert invalid data to default value.

4. (Optional) Create your own Serde, a serialize-deserialize interface which hive used to read and write data into its specified hdfs location, to validate your data type. You can find out how to write a custom hive Serde in my other blog post link.

An error will be displayed if there is any data type mismatch.

Conclusion

In conclusion, even though using hive as a data warehouse tool gives you a lot of benefits. But there are some necessary issues which you should concern including “Data validation on read” in order to handle your data properly.

In contrast to “Data validation on write” when calling an INSERT command hive will handle data type validation process for you automatically, so you don’t have to concern about it. If you want more detail about how to create “Custom hive Serde” you can go to my next blog post. I wish that this blog will help you understand some limitations of hive and overcome it with an appropriate solution.

--

--

Patraporn Kuhavattana
Analytics Vidhya

A data scientist who enthusiast to know more about the world. A book lover who interested in literature, science and philosophy genres.