Data Engineering concepts: Part 6, Batch processing with Spark

Mudra Patel
8 min readFeb 25, 2024

This is Part 6 of my 10 part series of Data Engineering concepts. And in this part, we will discuss about Batch processing with Spark.

Contents:
1. Batch processing
2. Apache Hadoop
3. Apache Spark
4. Use cases

Here is the link to my previous part on Data Orchestration:

What is batch processing?

Batch processing is the method of executing high volume, repetitive data jobs at certain intervals(hourly, daily) of time in groups/batches. It can be usually done at non-peak times like the end of the day or overnight. If the task is requiring minimum human intervention, and would be more efficient to do in batches than individually like backups, filtering and sorting, we could perform batch processing on the data collected and gather regular insights.

Apache Hadoop

Hadoop is a open source framework used to store big data and process it in parallel through clustering on multiple machines. There are 3 main components of Hadoop:

--

--