Spark Series #2 : Evolution of Spark

Aruna Das
6 min readAug 10, 2023
Image Source — Created by Aruna Das

History of Spark

Apache Spark originated as a research project at UC Berkeley’s AMPLab, focusing on big data analytics. It introduced a programming model that offers broader application support compared to MapReduce while maintaining automatic fault tolerance.

MapReduce is inefficient for certain types of applications that involve low-latency data sharing across parallel operations. These applications are common in analytics and include iterative algorithms (e.g., machine learning and graph algorithms like PageRank), interactive data mining, and streaming applications with aggregate state maintenance.

Conventional MapReduce and DAG engines are suboptimal for these applications because they follow acyclic data flow. Each job reads data from stable storage, performs computations, and writes the results back to replicated storage, incurring significant data loading and writing costs at each step.

Spark addresses these challenges with its resilient distributed datasets (RDDs) abstraction. RDDs can be stored in memory without replication, rebuilding lost data on failure using lineage information. This approach enables Spark to outperform existing models by up to 100x in multi-pass analytics.

The initial version of Spark supported only batch processing. However, due to early…

--

--

Aruna Das

Fremont, CA | Senior Data Engineer | Interested in ML , AI