What’s really going on between Spark and Hadoop ?

Published in

It’s a data world

5 min readDec 23, 2015

I wrote previously about the story behind distributed data processing, and how Spark changes the deal. The conclusion of these articles is basically that Spark is awesome because:

it can process super large volumes of data
super fast,
in a resilient manner,
and it is really good for Machine Learning on very large datasets (which is what we all want to be doing these days).

I originally published this article over here btw.

So why wouldn’t everyone just quit everything and use Spark?!

Spark is fast. Yes. It’s faster overall than other technological stacks for sure. However, it isn’t optimal for all the use cases out there. It’s much slower for certain specific operations than MapReduce, and each query does take some time to set up. You should always chose your system based on what you’re going to be doing with it so as to not waste resources.

If you just want to do SQL type queries on distributed data, you can just go with Impala for example, no need for Spark. (The good news is you can use BOTH in Data Science Studio! I know, it’s crazy I should mention it). Spark is useful when you want to do lots of different things with your data. It’s like a Swiss army-knife. One that’s really good at Machine Learning.

Do you really really need Spark?

Lets face it, you probably don’t need all of those super fancy features that come with a Hadoop and Spark stack. I’m guessing you are not Google or Amazon and you don’t have that much data to work with. (If you are Amazon or Google, everyone knows you’ve worked out your stack at this point, don’t brag)

Yes, everyone these days is sorta obsessed with SparkStreaming. And they should be, it’s amazing! I mean you can process your data as it comes in to get live analytics and even train algorithms live. That’s crazy! It’ VERY useful to monitor the performance of an energy grid or for cybersecurity for instance. But be honest, most of the time, you don’t need to be able to get live insights from your data LIVE.

Basically, unless you work in trading, fraud detection, network monitoring or you’re a search or recommendation engine, you do not need “real-time” big data infrastructures. Analysing your customers’ behaviours each evening to send them really cool notifications or emails, or recommend products for them is already pretty awesome. And you can do that without Spark.

There’s a chance you don’t even need HDFS

Also, Spark is generally faster than MapReduce, yes, but if your data can fit in a single server’s RAM, it will always be faster to not distribute it and process it all in one place. In that case, using non distributed Machine Learning libraries like scikit-learn and performing queries in a SQL database is still the most efficient way to go. More often than not, your data is not as big as you think and can actually hold in RAM.

Truth bomb: big data is a little bit of a fraud. (But you didn’t hear it from me.)

In many cases, even if you are storing massive amounts of data, you can aggregate that distributed data in a smaller dataset that will not need to be distributed.

Efficient predictive modelling is based on enriching your data and choosing the right features, not pure volume, and if you do that right you save a lot of storage space and can very often process your aggregated data in RAM, without having to go through distributed systems.

This is important because today non distributed algorithm libraries using Python or R are much more developed and offer more possibilities than distributed projects. And even if you want to predict churn on a really large volume of logs stored on a cluster for instance, after cleaning your data you’ll actually be processing one line per customer, which is already a much smaller volume of data. So unless you have billions of users, which I will gladly wish you, you can then run a very large variety of algorithms when that dataset holds in memory.

Open sourced

Moreover, it is still just the beginning of Apache Spark. It’s an open source project so it moves fast, but that also means that the support infrastructure and security around it aren’t very advanced yet. Also, with each new version of Spark a lot of things change and you may have to edit a lot of your code. That means you’re going to have to invest a lot in maintenance and you can’t necessarily have an application that easily works on multiple Spark versions.

Hadoop versus Spark is BS

So in the end, Hadoop and Spark aren’t actually enemies at all and work together very well. Spark was originally designed as an improvement of MapReduce to run on HDFS and loses its effectiveness when implemented otherwise. You can even consider Spark as a feature to be added to a Hadoop infrastructure to allow for machine learning, stream processing, and interactive data mining.

The two products are open-source projects anyway, which makes it less relevant to talk about competition. The companies making money over these infrastructures today offer both and advise customers on which infrastructure to use based on their needs. And if you think about it, it’s actually a good thing for open source projects to be in a competition since it makes them that much more dynamic!

Let’s conclude

The truth is Spark is ahead of its time. A lot of companies don’t need it know, and its still a young technology, but aren’t you glad to know that when you scale, and get billions of users, there will be great solutions out there so you can keep doing what you want with your data! That’s why it’s important for data softwares to integrate it, to make it easier to use and help it interact with all the other technologies out there. And, o wait, that’s what we do!