Spark Parallelism Deep Dive Writing

Published in

Analytics Vidhya

3 min readMar 7, 2020

This is one of my stories in spark deep dive

Spark is a distributed parallel processing framework and its parallelism is defined by the partitions. Let us discuss the partitions of spark in detail.

There are 3 types of parallelism in spark.

Parallelism in Reading
Parallelism in shuffle
Parallelism in writing

Parallelism in reading and writing is already discussed in previous blog

Spark Parallelism Deep Dive-I(Reading)

This is one of my stories in spark deep dive series.

medium.com

Spark Parallelism Deep Dive-Part II

This is one of my stories in spark deep dive

medium.com

Parallelism in writing is divided into 2 parts

Controlling files while writing
Controlling file size while writing

Controlling files while writing

Let us read a sample dataframe and let us see how many partition it has

Now let us write the df in parquet format

On exploring the written directory we can find that the number of part-files is equal to the number of input partitions

Let us shuffle the data and let us see how the data is getting distributed

So spark will write the data into number of part files based on number of partitions of rdd

Let us check if spark.shuffle partitions has some effect on writing

I changed the shuffle.partition to smaller value to check if any changes occured

So from the above image there is no effect of shuffle partition while writing

as the data written depend on rdd partitions

Controlling file size while writing

So once we write the data all data under single partitions is written into a single file which may be of uneven size.

So let us try to control max file size under single partitions so that we can have more uniformly distributed size(128 mb)

We have a property with which we can control the number of records per file under each rdd partition

So we can make use of maxRecordsPerFile to control the number of records written per file

Since we have 887 records we will divide the records per file as 200 as

shown below

From the above image we can see that even though there is one partition we have 5 files and we can play with max records to get files of size 128 mb based on your data

Conclusion:

So to conclude spark write part files is controlled by rdd.partitions and we can make use of max records per file to save the data as files of uniform size

That’s all for the day !! :)

Github Link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv

Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)

Learn and let others Learn!!

Spark Parallelism Deep Dive Writing

Spark Parallelism Deep Dive-I(Reading)

This is one of my stories in spark deep dive series.

Spark Parallelism Deep Dive-Part II

This is one of my stories in spark deep dive

Written by somanath sankaran