Hive — Partition Column Name — CamelCase — Bad Idea

[ This was originally posted on my blog on July 3, 2015 ]

Outside Java code I prefer snake_case over camelCase. This is mostly a preference without any strong good reason:

  • Without a proper IDE I find it easier to read snake_case words than camelCase words.
  • Python’s naming convention uses snake_case for variable names. They use camelCase only for class names.
  • Languages like MySQL, Hive, etc convert everything to lowercase. If you name a column orderDate then it’s actually stored as orderdate.

I never ran into any problems, until yesterday. The problem occurred because Hive converted everything to lowercase while the underlying filesystem was case sensitive causing Hive to not find the partitions. Let me go into the details by the help of an example.

Let’s create a table to hold order details. We have stored the data in HDFS partitioned by order date.

CREATE EXTERNAL TABLE orders (
order_id string,
item_id string,
quantity int,
checkout_price double,
address_id string
) PARTITIONED BY (orderDate string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/tables/orders/us';

I stuck to my convention and used snake_case for column names. But if you notice carefully, you’ll see that the partition name is in camelCase.

A little background on Hive partitions first. In Hive, partitions are essentially folders that contain data. They also double up as columns that can be used in queries. So when you say that the location is /tables/orders/us and that the partition column is orderDate then Hive searches for data in the HDFS locations: /tables/orders/us/orderDate=<>/.

The problem with this table definition is that the partition name is in camelCase. Since Hive converts everything into lowercase Hive stores the column name as orderdate instead of orderDate. You can confirm this by running the query desc orders;. So now when you fire the command msck repair table orders; to load all the partitions Hive searches for the folders: /tables/orders/us/orderdate=<>/ and because the underlying filesystem is case sensitive Hive doesn’t find any folders to load as partitions! Hive will happily tell you that the folders with name orderDate are not in it’s metastore but will not load them because Hive is actually only interested in folders with name orderdate.

So now I am in a nice soup! But at least through this experience I have learnt (the hard way) that when writing data into HDFS (or S3 as we do in Amazon) I should avoid using camelCase words and only use snake_case in the path.