Hadoop Vs Spark. Which one to use?

Harinath Selvaraj
coding&stuff
Published in
5 min readSep 18, 2018

I’ve seen quite a lot of people who asks me the same question.

Hadoop or Spark? Which one do I need to learn or adopt for my business?

What am I gonna study?

Of course it depends on various parameters, however I will tell you which one is easier and can be better on the long run at the end of this article.

We will be comparing these two frameworks based on these parameters. Lets start with

Performance

Spark — Spark is fast because it has memory processing. It also uses the disk for data that doesn’t fit in to memory. Spark in memory processing delivers near Real time analytics and this makes Spark suitable for Credit card processing system, Machine Learning, Secure Analytics and processing datas for IOT Sensors.

Hadoop — Hadoop is originally designed to continuously gather data from multiple sources without worrying about the type of data and storing it across the distributed environment and map reduce uses Batch processing. Map Reduce is never built for real time processing. Main idea behind Yarn is parallel processing over distributed data set.

The problem of comparing the two is that, they have different way of processing and the idea behind the development is also divergent.

Ease of Use

Spark — Spark comes with User friendly APIs for Scala, Java, Python and SparkSQL. SparkSQL is very similar to SQL and therefore it becomes very easy for SQL developers to learn it. Spark also provides an UI screen for the users to Query and perform other actions that can have immediate feedback.

Hadoop — You can easily ingest data either by Shell or by Integrating it with multiple tools like Scoop, Flume. Yarn is just an processing framework that can be integrated with multiple tools like Hive and Pig for Analytics. Hive is an data warehousing component which performs reading, writing and managing large datasets in a distributed environment using UI.

Both of them have their own ways to make themselves user friendly.

Costs

Hadoop and Spark are both Apache Open Source Projects. Therefore, there is no cost for the software. Cost is only associated with the Infrastructure. Both the products are designed in such a way that it can run on commodity hardware with low TCO (Total Cost of Ownership). Now you might be wondering the ways in which they are different since they are the same cost wise. Storage and processing in Hadoop is Disk based and it uses standard amounts of memory. Therefore, with Hadoop, we require a lot of disk space as well as faster transfer speed. Hadoop also requires multiple system distribute the disk I/O. Apache Spark, due to its in memory processing, it requires a lot of memory but it can deal with standard speed and amount of disk. As disk space is relatively inexpensive commodity and since Spark doesn’t use disk I/O for processing, instead it requires large amount of RAM to execute everything in memory. So, Spark systems incur more cost but it reduces the number of required systems. It requires fewer systems that cost more.

Hadoop is lesser cost but Spark is advisable for companies with lesser data

Data Processing

Two types of data processing —

a) Batch processing

b) Stream (or) Online processing

Batch processing is crucial to the big data world. In simpler terms, Batch processing involves working with high data volumes collected over a period. Data is first collected and then processed and then the results are produced at later stage. It is the efficient way of processing large static data sets. Generally it is used for archived datasets. ie) calculating avg. income of a country in the current year.

Stream processing is the current trend in the big data world. Need of the hour is speed and real time information which is what stream processing does. Batch processing doesn’t allow the businesses to quickly react to changing business needs in realtime. Stream processing has seen a rapid growth in that demand.

Yarn is a batch processing framework for Hadoop. When we submit a job to Yarn, it reads data from the cluster, performs operation and writes the results back to the cluster. It then performs the next operation and write the results back to the cluster and so on. On the other hand, Spark is designed to cover a wide range of work loads such as batch application, Iterative algorithms, Interactive queries and streaming as well.

Fault Tolerance

Hadoop and Spark provides fault tolerance but follows different approaches. Unlike Hadoop, Apache Spark can refer to any dataset present in external storage system. RDDs (Resilient Distributed Datasets) can persist the data across operations which makes future actions 10 times more faster. If the RDD is lost, it will be recomputed using the original transformations.

There has been a tough competition between Hadoop and Spark. Read below to find out their use cases and what technology is trending now!

When you’re processing archival data, the YARN processes data in parallel on different data nodes and gathers results from each node manager. In cases where Instant results are not required, Hadoop is a good and economical solution for batch processing. However it is incapable of processing data in real-time.

Spark is used in real time data analysis ie) data that comes from real time event streams coming in at the rate of millions of events/sec. The strength of Spark lies in its abilities to support streaming of data along with distributed processing and Spark claims to process data 100 times faster than Map Reduce which 10 times faster with the disks. It is used in Graph processing. Spark contains a graph computational library called GraphX which simplifies our life. It is also used in Iterative Machine Learning Algorithms. Almost all Machine learning algorithms work iteratively. Iterative Algorithms involve I/O bottlenecks in the map reduce implementations. Map Reduce uses coarse grain tasks that are too heavy for iterative algorithms. Spark catches the intermediate dataset after each iteration and runs multiple iterations on the cache dataset which eventually reduces the I/O overhead and execute the algorithm faster in a fault-tolerant manner.

Spark cannot completely replace Hadoop but the demand for Spark is currently at an all-time HIGH ! 😃

--

--