Bucketing in Hive : Querying from a particular bucket

Mohamed Camara
3 min readApr 4, 2019
Photo Credit: DataFlair

Hive provides a feature that allows for the querying of data from a given bucket. The result set can be all the records in that particular bucket or a random sample data. Let’s say you added buckets when you were creating your table and you would like to fetch records from bucket 2, for example. This tutorial will go over that and also touch up on other important related information.

Just like partitioning, bucketing helps with optimization when working in Hive. Here are a few things to cover on buckets :

  • The CLUSTERED BY clause indicates the column on which the table is hash-partitioned/clustered on
  • The numbering order of buckets starts from 1.
  • When buckets are created they appear in HDFS as files, unlike partitions which show up as directories
  • Each file is identified by a number determined based on the hash_function(bucketing_column) mod num_buckets.
  • Buckets can be created on a table even without the table being partitioned beforehand.
  • If the table is partitioned then the bucketed files reside in each partitioned directory.
  • When querying from a bucket on a partitioned table, a partition value must be specified
  • Bucketed tables can allow for more efficiency in mapside join operations.

The syntax used to sample data from a bucket is tablesample and it is placed in the FROM clause in a query. In general, the tablesample clause allows for the querying of samples of data from a table whether bucketed or not. Different parameters are specified for each operation.

SELECT col_name FROM table_name TABLESAMPLE([param]);

In Block Sampling for example, one can randomly sample n rows or percentage of data from a table.

TABLESAMPLE (n PERCENT)

TABLESAMPLE (n ROWS)

To query records from a particular bucket, the syntax below can be used.

SELECT col_name FROM table_name TABLESAMPLE(BUCKET x out of n on bucket_col_name)

NOTE: This same syntax can be used on a non-bucketed table by specifying the sampling expression needed for the non-bucketed table. Sampling expression could be a column name or rand() function.

If the column indicated in the TABLESAMPLE clause is the same column that was specified in the CLUSTERED BY clause, TABLESAMPLE will only scan through the corresponding hash-partitions of the table.

Example 1

Example 2

--

--

Mohamed Camara

Big Data Fanatic. Interested in everything Data Engineering and Programming.