Partitioning vs Bucketing — In Apache Spark
Access this article for free at Partitioning vs Bucketing — In Apache Spark. However, I highly recommend becoming a Medium member to explore more engaging content and support the talented writers.
Data partitioning is of immense importance when dealing with Big Data. Performance of the jobs largely depends on the way data is handled. In the world of Big Data, where we need to process terabytes and petabytes of data, partitioning our data into physical or logical partitions is a critical technique, and querying or processing on these partitioned data significantly improves performance.
Data Partitioning — What & Why ?
Data Partitioning is a way to split large data into smaller logical chunks so that data can be processed in chunks and in parallel, thus improving performance and achieving more parallelism.
Partitioning can be, simply put as a way to split the larger chunk of data into smaller logical chunks so that smaller chunks can be process in parallel and job performance can be improved. For example, we need to perform some operations on a 100 GB file. Processing on the single file of size 100 GB can take hours or maybe days based on the operation. However, if the same file could be broken down into 100 files of 1GB each and then accessed in parallel, our total time…