What’s the buzz about Spark?
We all have heard about Spark and all about how big data has become a thing lately, though it has been around for more than a decade. Recently, after some intriguing sessions organised by Qubole on big data engines, along with my own curiosity about all the buzz, I now have understood (or would like to believe so) how Hadoop, Spark, Presto, Hive are similar as well as different.
Spark is a big data, open source engine built by keeping in mind speed, developer friendliness and analytics. If you have a large dataset which needs to be processed with low latency and you have failed to achieve this with Mapreduce — Spark is the way to go! It is a result of hard-work & dedication of professors of University of Berkeley with the support of some students. Back in 2006, when Hadoop was released, the technological world first experienced the concept of Enterprise Parallel Processing engine. Though it was not enhanced/evolved, to its best form, it opened gates for hungry souls to deep dive and explore more about the field.
Of all the developer’s delight, the most luring part is the attractive API’s of a library which can make his/her life easier. Spark was built keeping this in mind and has an exhaustive list of API’s which simplifies all operations on a distributed system. Every developer tends to have an attachment with some language, which he/she would prefer going to grave with. While most big data engines are compatible only with Java, Spark has successfully been able to have a good support for Scala, Java, Python, R. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
The secret sauce that Spark has which makes it different is it’s unique way of processing data in-memory. While other engines like Hadoop prefer storing the intermediate step data on the disk, Spark stores this result in RDDs which might or might not be on the disk, decision on which is taken Spark by taking into account multiple factors.
The authors of Spark feel that even though it is one of the fastest engines, there is still some scope for improvement. The next blog of this series discusses about the scope of development in Spark 1.6 and the problems that were overcome in Spark 2.0.
I have intentionally kept this article simple and generic. Will try to deep dive into technical aspects from the next. This is my first blog, so please correct me if you feel there are any technical/grammatical errors.