Apache Spark vs Hadoop: Can I replace one with other

Sephinreji
3 min readDec 7, 2022

--

Both Spark and Hadoop serve as big data frameworks, clearly fulfilling the same purposes. However, they have several differences in the way they approach data processing. They use the concept of Divide and Conquer. Instead of processing data in a single server, we process them in multiple servers or worker nodes or Data nodes and final result will be consolidated.

Both are Distributed open source data storage and processing framework (Big Data). In addition to Hadoop SPARK has inmemory storage and processing ability. Both use the concept of Divide and Conquer. Instead of processing data in a single server, we process them in multiple servers or worker nodes or Data nodes and final result will be consolidated.

Hadoop:

Hadoop is a open source, highly fault tolerant distributed ecosystem mainly focusing on Batch processing of data. Hadoop mainly comprises of 4 factors:

HDFS (Hadoop Distributed File System): Its a file system for storing large volume of data.

MapReduce: This is the processing sector of Hadoop ecosystem that actually process the data from HDFS in distributed manner. MapReduce comprise of Map and Reduce. Map (In spark we call Transformation)is responsible for data transformation within varios distributed nodes, Reduce (In spark we call Action) is responsible for getting the output of map operation from each node and comprise it in to final result.

YARN(Yet Another Resource Negotiator): YARN is responsible for resource assignment and management which distributes available resource based on the demand.

Hadoop Common: Its the core of Hadoop which supports other Hadoop modules.

The major difference of hadoop with spark is its way of processing.

As shown in above image, in hadoop each intermediate results will be written back to hard disk(HDFS/S3)

Spark:

Spark is a distributed data processing framework. Spark is not a replacement of Hadoop. Its the replacement of MapReduce in hadoop. MapReduce model reads and writes from disk and thus the performance will get poor. Spark supports in memory(RAM) processing of data. It reduces the number of read/write cycles to disk and stores intermediate data in memory, hence faster-processing speed.

Spark supports different types of big data workloads like:

  • Batch processing.
  • Real-time stream processing.
  • Machine learning.
  • Graph computation.
  • Interactive queries.

Conclusion:

Hadoop and Spark are Distributed open source data storage and processing framework (Big Data). In addition to Hadoop SPARK has inmemory storage and processing ability and no one is the replacement of other.

--

--