Hadoop or Spark? The Most Trending Big Data Debate

Bharvi Vyas
DataFlair
Published in
5 min readMay 16, 2019

Are you in dilemma between the top Big Data technologies Hadoop and Spark, that which one will be better for your career? If yes, then you have come at the right place.

Here I will deliver you a point to point comparison of Hadoop and Spark. I promise you that after completing the tutorial, you will not have any confusion in choosing the right technology.

Let’s start with the introduction of Hadoop and Spark -

Introduction

Hadoop is an open source it used for process-batch data. MapReduce is a part of the Hadoop framework and it is responsible to process a large amount of data. For example - if we have 100 TB and we need to process it in Hadoop by using MapReduce with 7100 nodes then we will need 72 minutes to process it. It is much slower than Spark.

Spark is an open source, it is faster than Hadoop 100x while 10x faster with the disks. For example - Spark can process 100Tb with 706 nodes in 23 minutes that means it is much faster than Hadoop.

You can master both the Big Data frameworks through the online Hadoop and Spark training provided by DataFlair.

Hadoop vs Spark

Many experts say Spark is better than Hadoop, let’s find out the truth behind this. Here is a feature-wise comparison of Hadoop and Spark.

Speed

It is batch processing, that means we can collect the data in the morning and process it in the night.

Spark can work in real time and collect the data and process it immediately without needing to wait or store it.

You can learn everything about Spark through the FREE DataFlair Spark Tutorial Series.

Easy Management

In Hadoop, we are depending on different applications to do several tasks such as Storm, Giraph, Impala, etc. while this is not true in spark.

Spark covers batch, interactive, iterative and streaming. All in the same cluster.

Language Developed

Hadoop MapReduce developed by using Java, then there needs to have knowledge of writing Java code to write MapReduce program and it about hundreds of lines to calculate a simple task.

Spark developed by Scala so there is no need to have the knowledge in java coding. For example - we can use one-line from the codes to execute specific tasks by using Spark SQL or Scala or Python and no need to write many lines of java codes to compute a task.

Difficulty

Hadoop needs to handle every task in hundreds of lines.

Hadoop use RDD then it can customize each RDD to a specific task (according to business logic)

Applications

Hadoop is batch data processing only, it does not have any other applications.

Spark working with streaming, interactive, Spark SQL, Spark R, GraphX and MLlib(Machine Learning). That’s mean spark is a comprehensive program.

Latency

Hadoop MapReduce has to read from and write to a disk and it costly and take a long time. It shows high latency.

Spark can do data processing in-memory, and then it will be faster. It shows low latency.

Data Processing

Hadoop cannot process the data interactively.

Spark can process data interactively.

Caching

MapReduce can not cache the data for future needs, so it is not fast enough compared to spark.

Spark can cache the data for further processing or needs. It is faster than Hadoop.

Hardware Requirements

Hadoop MapReduce can run in commodity hardware.

Spark cannot run in commodity Hardware it can run on the mid-high-level commodity.

Machine Learning

Hadoop requires machine learning tool such as Apache Mahout.

Spark has its own set of machine learning i.e. MLlib

Recommended Reading — Comparison of Hive and Pig

There is no direct comparison of Spark and Hadoop because each of the technologies has its unique features.

You must check the Hadoop and Spark Use cases to understand the concepts in a better way -

Spark Use cases

Apache Spark at eBay

One of the world’s leading e-commerce site who has ruled this industry for long is eBay. eBay uses Apache Spark to provide offers to targeted customers based on their earlier experiences and also tries to leave no stone unturned in enhancing the customer experience with them. This not only enhances the customer experience in providing what they might require in a proactive manner but also helps them to efficiently and smoothly handle customer’s time on the e-commerce site.

Apache Spark at TripAdvisor

TripAdvisor, mammoth of an Organization in the Travel industry helps users to plan their perfect trips (let it official, or personal) using the capabilities of Apache Spark has speeded up on customer recommendations. It helps users with recommendations on prices querying thousands of providers for rates on a specific route and helps users in identifying the best service that they would want to avail at the best price available from the plethora of service providers.

It is the right time to update your skills and take your career to new heights in Big Data. The most in-demand skill is Spark. Start learning Spark with industry experts.

Hadoop Use Cases

Hadoop at British Telecom (BT)

British Telecom (BT) uses a Cloudera enterprise data hub powered by Apache Hadoop to cut down on engineer call-outs. By analyzing the characteristics of its network, BT can identity whether slow internet speeds are caused by a network or customer issue. They can then evaluate whether an engineer would be likely to repair the problem. The Cloudera hub provides a unified view of customer data stored in a Hadoop environment. BT earned a return on investment of between 200 and 250 percent within one year of the deployment. BT has also used it to create new services such as “View My Engineer”, an SMS and email alerting system that lets customers track the location of engineers. The company now wants to use predictive analytics to improve vehicle maintenance.

Hadoop at CERN

The Large Hadron Collider in Switzerland is one of the largest and most powerful machines in the world. It is equipped with around 150 million sensors, producing a petabyte of data every second, and the data being delivered is growing all the time. CERN researcher Manuel Martin Marquez said: “This data has been scaling in terms of amount and complexity, and the role we have is to serve to these scaleable requirements, so we run a Hadoop cluster.” In the Large Hadron Collider, particles are accelerated to very high speeds and are made to collide with each other. The millions of sensors present inside capture various data pertaining to the experiment. This captured data is then stored and analyzed in a Hadoop cluster. By using Hadoop CERN has been able to limit the cost in hardware and complexity in maintenance.

It is the right time to start your learning and if you want to become a master in Big Data, you need to learn both Hadoop and Spark from industry experts.

Clear with the concept? Tell me your thoughts on this through comments.

Don’t forget to give me your 👏 !

Related Topics —

How Hadoop Works?

How Spark Works

--

--