Apache Spark: how to choose the correct data abstraction?

Federico Sala
Jun 26, 2019 · 8 min read

Apache Spark offers three different APIs to handle sets of data: RDD, DataFrame, and Dataset. Picking up the correct data abstraction is fundamental to speed up Spark jobs execution and to take advantage of Spark’s internal optimizations. Besides, choosing a good data structure quickens the development process.

Photo by Tianyi Ma on Unsplash