Photo by Anna Shvets

Understanding Initial Partitions in Apache Spark

Harshavardhan Achyuta
2 min readAug 1, 2023

--

Partitions are the fundamental units of data distribution in Spark. A DataFrame or RDD in Spark is logically divided into smaller chunks called partitions, and each partition contains a subset of the data. Spark processes partition in parallel, enabling faster data processing on distributed systems.

Initial Partitions for single file

partition_size = min(128 MB, file_size / total_number_of_cores)
  • 128 MB: The default value of spark.sql.files.maxPartitionBytes. It ensures that each partition's size does not exceed 128 MB, limiting the size of each task for better performance.
  • file_size: The size of the single file being read into Spark.
  • total_number_of_cores: The total number of CPU cores available in the Spark cluster.

Initial partition for a single file for non splittable files

  1. Compressed CSV Files (e.g., Gzip, Snappy): Compressed CSV files, such as those using Gzip or Snappy compression, are not splittable. These compression codecs do not allow Spark to efficiently read and process the data in parallel. As a result, when you read a compressed CSV file into Spark, it is treated as a single non-splittable partition.
  2. Compressed Parquet Files (e.g., Snappy): In contrast, Parquet is a columnar file format that supports compression, and certain codecs, like Snappy, are splittable. Since Parquet files store data in a columnar format, Spark can efficiently read and process only the columns needed for a specific query. When using splittable compression codecs like Snappy with Parquet, Spark can read the data in parallel across multiple partitions. As a result, the number of partitions will be determined based on the size of the Parquet files and other factors like parallelism settings.

Initial Partition for multiple files

The spark.sql.files.openCostInBytes setting controls the estimated cost of opening a file in Spark. By default, it is set to 4MB. This configuration parameter impacts the initial partitioning behavior when reading multiple files.

Let’s understand this with examples:

Suppose we have a folder containing three files:

Example1:

File1.csv: 50MB

File2.csv: 50MB

File3.csv: 50MB

partition1: 4+50+4+50 (File1+File2)

partition2: 4+50 (File3)

The number of Initial partitions is 2

Example2:

File1.csv: 90MB

File2.csv: 90MB

File3.csv: 90MB

partition1: 4+90 (File1)

partition2: 4+90 (File2)

partition3: 4+90 (File3)

The number of Initial partitions is 3

#BigData #DataEngineer

--

--