Apache Spark and Amazon S3 — Gotchas and best practices
subhojit banerjee
1437

I ran into this at my last job and created this afterwards: https://github.com/EntilZha/spark-s3 (docs: http://spark-s3.entilzha.io/latest/api/ example: https://github.com/EntilZha/spark-s3/blob/master/src/test/scala/io/entilzha/spark/s3/S3ContextSpec.scala)

It implements the method showed in the post plus balancing files with the Least Processing Time algorithm. I saw huge performance boosts since the cluster was no longer idling on the driver.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.