Improve s3 write performance with magic committer in Spark3
Comparing traditional & magic s3 committer and a guide to use magic committer
In data engineering, the utilization of object stores like Amazon S3 is ubiquitous, serving as a data lake for storing both raw and transformed data. Consequently, a many spark Extract, Transform & Load (ETL) jobs write data back to s3, highlighting the importance of speeding up these writes to improve overall ETL pipeline efficiency and speed. Moreover, mitigating failures during the write stage is imperative to ensure the completion of processes and prevent incomplete or inaccurate data.
With the introduction of Spark 3.2, data engineers now benefit from an enhancement called magic committer (also known as s3a committer). This committer not only enhances the performance of Spark ETL processes but also bolsters reliability and scalability, marking a significant advancement in data engineering workflows.
Magic committer uses a different approach to writing data in s3 as compared to other traditional Hadoop output committers.
Tradition Committer vs Magic Committer
Below is how traditional committer works:
While write is in progress, if we list the final s3 path we will see a _temporary/
aws s3 ls s3://<bucket_name>/prefix/
PRE _temporary/
While write is in successful, if we list the final s3 path we will see a _SUCCESS file of size 0 written
aws s3 ls s3://<bucket_name>/prefix/
2023-03-16 00:29:57 0 _SUCCESS
Imagine if a failure occurs while the copying files operation is in progress, the data output can be corrupted. In addition to being unsafe, it can also be very slow.
Magic committers writes data directly to the final S3 destination in parallel, instead writing files to a temporary location and then moving them to the final destination.
Below is how magic committers work:
Since it directly writes data to final destination, overhead of copying files to a temporary location and moving them later is avoided which in return increases the performance.
Since writes happen in parallel across multiple spark tasks/executors, it is scalable horizontally to handle large datasets
Either all data is written successfully, or none of it is written at all. This helps to prevent data inconsistencies and corruption.
While write is in progress, if we list the final s3 path we will see a __magic
aws s3 ls s3://<bucket_name>/prefix/
PRE __magic/
While write is in successful, if we list the final s3 path we will see a _SUCCESS file (and size won’t be 0) written
aws s3 ls s3://<bucket_name>/prefix/
2023-03-17 13:33:20 13756 _SUCCESS
_SUCCESS file will have committer:magic mentioned
How to enable magic committer in spark
val spark: SparkSession = SparkSession
.builder()
.appName(name = appName.getOrElse(getClass.getName))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3a.committer.magic.enabled", "true")
.config("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory")
.config("fs.s3a.committer.name", "magic")
.config("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
.config("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
.getOrCreate()
To determine the committer being utilized, we can refer to the Spark driver logs
Below log will ensure the magic committer is used
INFO AbstractS3ACommitterFactory: Using committer magic to output data to s3a://…
Additionally, ls
on the s3 directory will indicate the __magic
while the job is running
aws s3 ls s3://<bucket_name>/prefix/
PRE __magic/
Once the job finishes, ls
on the output s3 directory will have a non 0 byte _SUCCESS
file and the content of the file will have "committer" : "magic"
If the warning below appears in the Spark driver logs, it indicates that this committer is not in use, potentially signaling an issue with its implementation. Moreover, the warning explicitly states that it is unsafe and slow.
WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe.
Note: If the ETL pipeline employs table formats like Delta, Apache Iceberg, or Apache Hudi to write data to S3, the S3 committer is irrelevant because these table formats manage the commit process differently.
Conclusion
The Magic Committer is recommended to be used in ETL pipelines that write to s3 unless the ETL is already using table formats like Delta, Hudi, or Iceberg. This committer provides better performance and fault tolerance than traditional committers. This is because the Magic Committer can avoid the overhead of creating new output files and copying data to them for each write operation.
Thank you for reading! If you found this interesting, follow me on Medium and subscribe to my latest articles. Also, you can catch me on LinkedIn