Image for post
Image for post
Image Source: Pexels

GUIDE TO SPARK PARTITIONING

Building Partitions For Processing Data Files in Apache Spark

Continuing an earlier story on determining the number of partitions at critical transformations, this story would describe the reasoning behind the number of partitions created from the data file(s) in a Spark Application.

Ajay Gupta
Jun 15, 2020 · 10 min read
*File Format specific APIs*
Dataset<Row> = SparkSession.read.csv(String path or List of paths)
Dataset<Row> = SparkSession.read.json(String path or List of paths)
Dataset<Row> = SparkSession.read.text(String path or List of paths)
Dataset<Row> = SparkSession.read.parquet(String path or List of paths)
Dataset<Row> = SparkSession.read.orc(String path or List of paths)
*Generic API*
Dataset<Row> = SparkSession.read.format(String fileformat).load(String path or List of paths)
'path' in above APIs is either actual file path or directory path. Also, it could contain wildcard, such as '*'.There are more variants of these APIs which includes facility of specifying various options realted to a specific file reading. Full list can be referred here.
(a)spark.default.parallelism (default: Total No. of CPU cores)
(b)spark.sql.files.maxPartitionBytes (default: 128 MB)
(c)spark.sql.files.openCostInBytes (default: 4 MB)
maxSplitBytes = Minimum(maxPartitionBytes, bytesPerCore)
bytesPerCore = 
(Sum of sizes of all data files + No. of files * openCostInBytes) / default.parallelism
Image for post
Image for post
Illustration of the process of deriving the partitions for a set of data files, first the data files are split based on the computed value of maxSplitBytes, then the splits are packed into one or more partitions based on maxSplitBytes and opencostInBytes.
Image for post
Image for post
No. of partitions calculated for case (a)
Image for post
Image for post
No. of partitions calculated for case (b)
Image for post
Image for post
No. of partitions calculated for case ©
Image for post
Image for post
No. of partitions calculated for case (e)

Number of partitions when RDD APIs used for reading data files:

*SparkContext.newAPIHadoopFile(String path, Class<F> fClass, Class<K> kClass, Class<V> vClass, org.apache.hadoop.conf.Configuration conf)
*SparkContext.textFile(String path, int minPartitions)
*SparkContext.sequenceFile(String path, Class<K> keyClass, Class<V> valueClass)
*SparkContext.sequenceFile(String path, Class<K> keyClass, Class<V> valueClass, int minPartitions)
*SparkContext.objectFile(String path, int minPartitions, scala.reflect.ClassTag<T> evidence$4)
minSize (mapred.min.split.size - default value 1 MB)
blockSize (dfs.blocksize - default 128 MB)
splitSize = Math.max(minSize, Math.min(goalSize, blockSize));
where:
goalSize = Sum of all files lengths to be read / minPartitions
Image for post
Image for post
Illustration of the process of deriving the partitions for a set of data files, first the data files are split based on the computed value of splitSize, then each of the non zero splits is assigned to a single Partition.
Image for post
Image for post
Image for post
Image for post
No. of partitions calculated for case (a)
Image for post
Image for post
Image for post
Image for post
No. of partitions calculated for case (b)
Image for post
Image for post
Image for post
Image for post
No. of partitions calculated for case (c)
Image for post
Image for post
No. of partitions calculated for case (d)

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

Ajay Gupta

Written by

Big Data Architect, Apache Spark Specialist, https://www.linkedin.com/in/ajaywlan/

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

Ajay Gupta

Written by

Big Data Architect, Apache Spark Specialist, https://www.linkedin.com/in/ajaywlan/

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store