Hive vs Impala Schema Loading Case: Reading Parquet Files

Kartik Gupta
Jan 15, 2019 · 6 min read

Quite often in big data , comes a scenario where raw data is processed in Spark and then needs to be made available to the analytics team . For this purpose a standard solution is to write the processed data from the spark application in the form of parquet files in HDFS and then point a Hive/Impala table to this data upon which analytics team can then fire SQL like queries.

I came across one such scenario where in the end and I faced an issue where

Hive was able to correctly query the impala table created while impala itself wan’t able to give the correct result.

Here is what i observed about one key difference in how impala and hive read the parquet files . I would like to share that with others in case it might help them.

PLEASE FEEL FREE TO SKIP AND GO TO THE END TO SEE THE FINAL SUMMARY

Now suppose the spark dataframe printSchema gave the following output

root
|-- empId: integer (nullable = false)
|-- name: string (nullable = false)
|-- salary: integer (nullable = false)
|-- age: integer (nullable = false)

and data is as follows

+-------+--------+--------+------+
| empId | name | salary | age |
+-------+--------+--------+------+
| 1 | Ramesh | 50000 | 32 |
| 2 | Khilan | 52345 | 25 |
| 3 | Hardik | 78541 | 27 |
+-------+--------+--------+------+

here , when we write the the dataframe into parquet using the

df.write.parquet(<hdfs path>)

syntax , then parquet files will contain the columns in the order in which they are in the dataframe.

Now , suppose we created a table in impala now on this data as follows,

create table employee (
empId int,
salary int,
age int,
name string
)
stored as parquet
location '<hdfs path>'

Now when we call out select operation on the table we will get

impala> select salary from employee limit 5;

errors of the form

Error: required int found string error

However , if we were to run the same query in Hive, we will get the data back

(Note : Since Hive and impala share the same metastore, so any table created in impala is available in hive and vice versa)

beeline>select salary from employee+--------+
| salary |
+--------+
| 50000 |
| 52345 |
| 78541 |
+--------+

This leads to the conclusion that Hive first checks if the column name is present in the file and then reads that particular column if the datatype also matches. However , in case of impala , it is assumed that the columns mentioned in the create table are present in the same order in parquet , so while reading , only the datatype of the column is checked and column name is ignored. Thats why we get error in “select salary” because its the second column which in the parquet files turns out to name of the string type.

To further play with above hypothesis , i created another table in impala as follows

create table employee2 (
empId int,
salary string,
age int,
random_number int
)
stored as parquet
location '<hdfs path>'

now if we do a select query in impala

impala>select random_number from employee2+---------------+
| random_number |
+---------------+
| 32 |
| 25 |
| 27 |
+---------------+

however , Hive on running the same query will give error like

beeline>select random_number from employee2column 'random_number' not found in table employee2

Another difference in result will come out as follows

impala>select salary from employee2+-------------+
| salary |
+-------------+
| Ramesh |
| Khilan |
| Hardik |
+-------------+

However , Hive will give the error —

beeline>select salary from employee2
ERROR:found string required int

(Note here : I haven’t used the actual outputs for the queries, the result messages are along the lines of what the actual output is and are good enough to convey the actual output semantically)

Another common variation of the side effect caused due to this difference is when all the columns are of the same type but the create table statement has the columns in a different order than the order present in parquet format.

For eg. lets remove the name column from the example dataframe so that we have same datatypes, now the dataframe data will look as follows -

+-------+--------+------+
| empId | salary | age |
+-------+--------+------+
| 1 | 50000 | 32 |
| 2 | 52345 | 25 |
| 3 | 78541 | 27 |
+-------+--------+------+

Now , parquet written from this dataframe will also contain columns in this order. Now, lets create the impala table as follows

create table employee3 (
empId int,
age int,
salary int,
)
stored as parquet
location '<hdfs path>'

Here , as shown above , by mistake one can put column ‘age’ before the column ‘salary’ , which the person may not think of anything harmful .

Now we will get the select query results in Impala

impala>select * from employee3
+-------+-------+-------+
| empId | age | salary|
+-------+-------+-------+
| 1 | 50000 | 32 |
| 2 | 52345 | 25 |
| 3 | 78541 | 27 |
+-------+-------+-------+

but in Hive , the same query will give the result as follows

beeline>select * from employee3
+-------+-------+-------+
| empId | age | salary|
+-------+-------+-------+
| 1 | 32 | 50000 |
| 2 | 25 | 52345 |
| 3 | 27 | 78541 |
+-------+-------+-------+

Notice , here that Hive will give you the expected result , however impala will give incorrect result and one will realize that data is coming correctly but simply in wrong order.

Hence , these little small experiment leads to the following useful conclusions/summary

SUMMARY

When parquet files are involved

  1. Hive matches both column names as well as the datatype when fetching the data and the ordering of columns doesn’t matter
  2. Impala assumes that the columns in create table statement are present in the parquet file in the same order . Hence , at the time of scanning impala will only check if the datatype ordering is same in create table statement and parquet file. This also means that it doesn’t matter what the column name is so much so that it doesn’t matter if the column is even present in parquet file or not!
  3. This two points can simply be summarized to say that Hive does pay attention to the schema present in parquet files while Impala simply imposes the create table schema on parquet file data.

So in case , one observes a different query output between Hive and impala and parquet files are involved , then its most likely that parquet file and create table statement contains different columns or the same columns in the different order .So a best practice would be to simply keep the column names in the create table statement in the same order as present in the source parquet file . I assume that these observations should extend to other source file formats but i haven’t checked myself.

Quick Note

In case someone went through the entire article, i have a little surprise for those (not so much surprise for those who directly went to summary)

We can make Impala also start looking at the column names just like Hive . This can act as a quick bugfix in case development is already done and one wants to avoid rework. One can do by setting the value parameter PARQUET_FALLBACK_SCHEMA_RESOLUTION to name . This can be done on a query basis or can be adjusted as a default setting also. For more info, please refer here . This official doc confirms the hypothesis which was tested above.

So , I hope that this article will be helpful to the readers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade