Spark Parallelism Deep Dive Writing

somanath sankaran
Analytics Vidhya
Published in
3 min readMar 7, 2020

This is one of my stories in spark deep dive

https://medium.com/@somanathsankaran

Spark is a distributed parallel processing framework and its parallelism is defined by the partitions. Let us discuss the partitions of spark in detail.

There are 3 types of parallelism in spark.

  1. Parallelism in Reading
  2. Parallelism in shuffle
  3. Parallelism in writing

Parallelism in reading and writing is already discussed in previous blog

Parallelism in writing is divided into 2 parts

  1. Controlling files while writing
  2. Controlling file size while writing

Controlling files while writing

Let us read a sample dataframe and let us see how many partition it has

Now let us write the df in parquet format

On exploring the written directory we can find that the number of part-files is equal to the number of input partitions

Let us shuffle the data and let us see how the data is getting distributed

So spark will write the data into number of part files based on number of partitions of rdd

Let us check if spark.shuffle partitions has some effect on writing

I changed the shuffle.partition to smaller value to check if any changes occured

So from the above image there is no effect of shuffle partition while writing

as the data written depend on rdd partitions

Controlling file size while writing

So once we write the data all data under single partitions is written into a single file which may be of uneven size.

So let us try to control max file size under single partitions so that we can have more uniformly distributed size(128 mb)

We have a property with which we can control the number of records per file under each rdd partition

So we can make use of maxRecordsPerFile to control the number of records written per file

Since we have 887 records we will divide the records per file as 200 as

shown below

From the above image we can see that even though there is one partition we have 5 files and we can play with max records to get files of size 128 mb based on your data

Conclusion:

So to conclude spark write part files is controlled by rdd.partitions and we can make use of max records per file to save the data as files of uniform size

That’s all for the day !! :)

Github Link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv

Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)

Learn and let others Learn!!

--

--