Comparison between Spark and Hive Bucketing

3 min readFeb 10, 2022

Hive Partitioning/Bucketing

Hive Partitioning is used for distributing the load horizontally. This is used for low carnality columns,

For example:
partitioning a student table on basis of State or Gender can distribute the load horizontally. Hive will be creating separate directories for each partition.

Bucketing is applied on columns which have high cardinality like that of student_id or similar primary-key columns, and can be bucketed into user specified number.

CREATE TABLE Students ( student_id string,
Name string,
class string,
dob string,
pincode float )
PARTITIONED BY (gender string,
state string)
CLUSTERED BY (student_id) INTO 5 BUCKETS;

This will be stored in :

/usr/hive/warehouse/gender=M/state=KL/ file1 file2 file3 file4 file5
/usr/hive/warehouse/gender=M/state=KA/ file1 file2 file3 file4 file5

Spark Bucketing/Partitioning

Just like Hive, In Spark, a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically.

For file-based data source, it is also possible to bucket and sort or partition the output. Bucketing and sorting are applicable only to persistent tables (Only saveAsTable and not for save operations)

df = spark.read.parquet(“/tmp/student.parquet”)
df
.write
.partitionBy(“country”)
.bucketBy(50, “id”)
.saveAsTable(“student_partition_bucketed”)

There are differences in approach in handling bucketing in spark and hive

Let us see the differences:

Hive :
🔸A single file represents one bucket.
🔸Buckets are sorted.
🔸Reducer operation populates a bucket (one reducer for one bucket), which requires sort/shuffle
🔸Exact number of buckets are created as per the table definition

Spark:
🔹A collection of files comprises of one bucket.
🔹Buckets are not sorted.
🔹 Multiple files can be associated with a bucket, and writes doesnt require shuffle.

One major difference is that Spark and Hive have different hash implementations.

Spark uses HashPartitioning which relies on Murmur3Hash.

Implementation:

def hash(cols: Column*): Column = withExpr {
new Murmur3Hash(cols.map(_.expr))
}

That is the non negative mod of expression.

https://github.com/apache/spark/blob/d2bdd6595ec3495672c60e225948c99bfaeff04a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2464

where as in hive,

The hash_function depends on the type of the bucketing column.

Example: For an int, hash_int(i) == i
Hive 2.1.0 mask_hash function that will hash string values.

Hive 2.x it uses md5 as the hashing algorithm.
Hive 3.x, this is changed to sha256

That is, in short, Spark support for Hive Bucketing is still In Progress (SPARK-19256) and Spark reads hive bucketed table as non-bucketed table.

Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs :

hive.enforce.bucketing
hive.enforce.sorting.

Spark writes orc/parquet bucketed table as non-bucketed table by disabling exception can be by setting config `hive.enforce.bucketing`=false and `hive.enforce.sorting`=false, which will write as non-bucketed table.

Spark will disallow users from writing outputs to hive bucketed tables by default (given that output won’t adhere with Hive’s semantics).
IF user still wants to write to hive bucketed table, the only resort is to use hive.enforce.bucketing=false and hive.enforce.sorting=false which means user does NOT care about bucketing guarantees.

There are advantages over performance in joins on bucketed fields, as bucketed tables contain preshuffled bucketed tables.

Refer for more detailed information refer the spark documentation link below.

Generic Load/Save Functions

In the simplest form, the default data source ( parquet unless otherwise configured by spark.sql.sources.default) will…

spark.apache.org

As part of SPARK-23803 and SPARK-12850, Bucket Pruning is supported from Spark 2.4 version onwards.

#BucketPruning — Feature in Spark, to reduce IO by filtering the dataframe by the field used for bucketing.
That is an optimization strategy, to push some predicates into the scan to skip scanning files that definitely won’t match the value.

Please #comment if I missed out any, and don’t forget to 👏🏾

Comparison between Spark and Hive Bucketing

Generic Load/Save Functions

In the simplest form, the default data source ( parquet unless otherwise configured by spark.sql.sources.default) will…

Written by Deepa Vasanthkumar