Improve s3 write performance with magic committer in Spark3

Comparing traditional & magic s3 committer and a guide to use magic committer

Rishika Idnani
Towards Data Engineering
4 min readMar 17, 2023

--

In data engineering, the utilization of object stores like Amazon S3 is ubiquitous, serving as a data lake for storing both raw and transformed data. Consequently, a many spark Extract, Transform & Load (ETL) jobs write data back to s3, highlighting the importance of speeding up these writes to improve overall ETL pipeline efficiency and speed. Moreover, mitigating failures during the write stage is imperative to ensure the completion of processes and prevent incomplete or inaccurate data.

With the introduction of Spark 3.2, data engineers now benefit from an enhancement called magic committer (also known as s3a committer). This committer not only enhances the performance of Spark ETL processes but also bolsters reliability and scalability, marking a significant advancement in data engineering workflows.

Magic committer uses a different approach to writing data in s3 as compared to other traditional Hadoop output committers.

Tradition Committer vs Magic Committer

Below is how traditional committer works:

Workflow of traditional output committer

While write is in progress, if we list the final s3 path we will see a _temporary/

aws s3 ls s3://<bucket_name>/prefix/
PRE _temporary/

While write is in successful, if we list the final s3 path we will see a _SUCCESS file of size 0 written

aws s3 ls s3://<bucket_name>/prefix/
2023-03-16 00:29:57 0 _SUCCESS

Imagine if a failure occurs while the copying files operation is in progress, the data output can be corrupted. In addition to being unsafe, it can also be very slow.

Magic committers writes data directly to the final S3 destination in parallel, instead writing files to a temporary location and then moving them to the final destination.

Below is how magic committers work:

Workflow of magic committer

Since it directly writes data to final destination, overhead of copying files to a temporary location and moving them later is avoided which in return increases the performance.

Since writes happen in parallel across multiple spark tasks/executors, it is scalable horizontally to handle large datasets

Either all data is written successfully, or none of it is written at all. This helps to prevent data inconsistencies and corruption.

While write is in progress, if we list the final s3 path we will see a __magic

aws s3 ls s3://<bucket_name>/prefix/
PRE __magic/

While write is in successful, if we list the final s3 path we will see a _SUCCESS file (and size won’t be 0) written

aws s3 ls s3://<bucket_name>/prefix/
2023-03-17 13:33:20 13756 _SUCCESS

_SUCCESS file will have committer:magic mentioned

How to enable magic committer in spark

    val spark: SparkSession = SparkSession
.builder()
.appName(name = appName.getOrElse(getClass.getName))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3a.committer.magic.enabled", "true")
.config("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory")
.config("fs.s3a.committer.name", "magic")
.config("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
.config("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
.getOrCreate()

To determine the committer being utilized, we can refer to the Spark driver logs

Below log will ensure the magic committer is used

INFO AbstractS3ACommitterFactory: Using committer magic to output data to s3a://…

Additionally, ls on the s3 directory will indicate the __magic while the job is running

aws s3 ls s3://<bucket_name>/prefix/
PRE __magic/

Once the job finishes, ls on the output s3 directory will have a non 0 byte _SUCCESS file and the content of the file will have "committer" : "magic"

If the warning below appears in the Spark driver logs, it indicates that this committer is not in use, potentially signaling an issue with its implementation. Moreover, the warning explicitly states that it is unsafe and slow.

WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.

Note: If the ETL pipeline employs table formats like Delta, Apache Iceberg, or Apache Hudi to write data to S3, the S3 committer is irrelevant because these table formats manage the commit process differently.

Conclusion

The Magic Committer is recommended to be used in ETL pipelines that write to s3 unless the ETL is already using table formats like Delta, Hudi, or Iceberg. This committer provides better performance and fault tolerance than traditional committers. This is because the Magic Committer can avoid the overhead of creating new output files and copying data to them for each write operation.

Thank you for reading! If you found this interesting, follow me on Medium and subscribe to my latest articles. Also, you can catch me on LinkedIn

--

--