Spark Series #6 : Data Ingestion From Files

Aruna Das
5 min readNov 18, 2023
https://www.sothebys.com/en/buy/auction/2022/art-impressionniste-et-moderne-day-sale/le-cabinet-anthropomorphique

In Spark, data is transient, meaning that once the Spark engine is shut off, all the data in memory is lost. Since Spark operates on in-memory processing, data is only available while the Spark session is active. Whether you are building an ETL pipeline or conducting analysis, data ingestion is the first crucial step.

--

--

Aruna Das

Fremont, CA | Senior Data Engineer | Interested in ML , AI