Get Hive count in seconds

Kirill Bobrov
Analytics Vidhya
Published in
4 min readNov 30, 2021

--

To get an accurate count of the amount of data, when you can’t have any less or one more, COUNT(*) is the only way. But there are times when you don't need an exact number, but you need a rough estimate of the table size, for example, to understand that the table is not empty, or to roughly estimate the size of the data to be migrated. There is a faster way than COUNT(*) for such tasks. We can use Hive statistics.

Hive statistics

Hive can better optimize complex or multi-table queries if it has access to statistics about the amount of data and how the values are distributed. This information includes physical characteristics such as number of rows, number of data files, the total size of data files, and file format. For tables with partitions, the numbers are calculated for each partition and aggregated for the whole table.

This metadata is stored in the Hive metastore database and can be easily updated. If any statistic is unavailable, a value of -1 is used as a placeholder. Some numbers, such as the number and total size of data files, are always kept up to date because they can be computed cheaply as part of the HDFS block metadata collection.

There are many ways to use such statistics information:

  • As already been mentioned Hive’s cost-based optimizer uses statistics to generate a…

--

--

Kirill Bobrov
Analytics Vidhya

helping robots conquer the earth and trying not to increase entropy using Python, Data Engineering, ML. Check out my blog—luminousmen.com