Improving your Apache Spark Application Performance

Simple Tips and Tricks to Improve the Performance of your Spark Applications

Robert Sanders
Software Sanders

--

Pixabay — Abstract Abstraction Acceleration — link

Apache Spark has quickly become one of the most heavily used processing engines in the Big Data space since it became a Top-Level Apache Project in February of 2014. Not only can it run in a variety of environments (locally, Standalone Spark Cluster, Apache Mesos, YARN, etc) but it can also provide a number of libraries that can help you solve just about any problem on Hadoop. This includes running SQL Queries, Streaming, and Machine Learning to name a few. All running on an optimized execution engine.

We at Clairvoyant have built many Data Pipelines with Apache Spark, including Batch and Streaming over the years. You can find out more information here. After having built so many pipelines we’ve found some simple ways to improve the performance of Spark Applications. Here are a few tips and tricks that we’ve found:

Use DataFrame’s instead of RDDs

Instead of using the RDD API

val rdd = sc.textFile("/path/to/file.txt")

Use the DataFrames API

val df = spark.read.textFile("/path/to/file.txt")

--

--

Robert Sanders
Software Sanders

Senior AVP of Data Management for EXL Services | Marathon Runner | Triathlete | Endurance Athlete