Why Spark is at least 10 times faster than Hadoop?

Sanjay Singh
Sanrusha
Published in
4 min readDec 22, 2021

--

Hadoop distributed processing vs Spark parallel processing

Photo by Abbas Tehrani on Unsplash

Needless to say, data is everywhere and Big data is becoming the de-facto standard for storing and processing large data. From Netflix to digitization of simple manual forms have become possible only because of big data. While big data has not only made data storage and processing faster but also cheaper and affordable, it also needs continuous improvements. In this article, I am going to take you through different types of data processing: distributed, parallel, and in memory.

Distributed Data Processing

Hadoop follows the approach of distributing the task among different computers. It uses the memory, storage, as well as processes of those computers to execute a task. This is called distributed processing.

If you give a task to a distributed processing application, it will divide that task into multiple computers available on that cluster. And those computers will perform those tasks simultaneously.

And that’s how the overall task is distributed among different computers, and all the computers contribute towards processing that task.

The below image shows how distributed data processing works.

--

--