Spark Parallelism Deep Dive Writing
This is one of my stories in spark deep dive
https://medium.com/@somanathsankaran
Spark is a distributed parallel processing framework and its parallelism is defined by the partitions. Let us discuss the partitions of spark in detail.
There are 3 types of parallelism in spark.
- Parallelism in Reading
- Parallelism in shuffle
- Parallelism in writing
Parallelism in reading and writing is already discussed in previous blog
Spark Parallelism Deep Dive-I(Reading)
This is one of my stories in spark deep dive series.
medium.com
Parallelism in writing is divided into 2 parts
- Controlling files while writing
- Controlling file size while writing
Controlling files while writing
Let us read a sample dataframe and let us see how many partition it has
Now let us write the df in parquet format
On exploring the written directory we can find that the number of part-files is equal to the number of input partitions
Let us shuffle the data and let us see how the data is getting distributed
So spark will write the data into number of part files based on number of partitions of rdd
Let us check if spark.shuffle partitions has some effect on writing
I changed the shuffle.partition to smaller value to check if any changes occured
So from the above image there is no effect of shuffle partition while writing
as the data written depend on rdd partitions
Controlling file size while writing
So once we write the data all data under single partitions is written into a single file which may be of uneven size.
So let us try to control max file size under single partitions so that we can have more uniformly distributed size(128 mb)
We have a property with which we can control the number of records per file under each rdd partition
So we can make use of maxRecordsPerFile to control the number of records written per file
Since we have 887 records we will divide the records per file as 200 as
shown below
From the above image we can see that even though there is one partition we have 5 files and we can play with max records to get files of size 128 mb based on your data
Conclusion:
So to conclude spark write part files is controlled by rdd.partitions and we can make use of max records per file to save the data as files of uniform size
That’s all for the day !! :)
Github Link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv
Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)
Learn and let others Learn!!