Big Data Analytics Generations

M Haseeb Asif
Big Data Processing
2 min readJan 31, 2020

Big data has many definitions but generally it refers to the data sets that doesn’t fit into your systems memory. Big data is often broken down into Vs: Volume, Velocity, value, variety and veracity. All of these Vs attribute certain properties in the data.

Recently data is being generated in much more quantity and pace that conventional data processing system cannot cope with it. On the other hand, Community has been going through the different generations of the big data processing tools to match the needs of the industry.

Apache Hadoop is considered the first generation. It introduced the concept of distributed processing with Map and reduce. Map Reduce was reading data from the disk , doing operation on it and storing it back to the disk.

There are couple of frameworks that introduced improvements over the Hadoop considered as 2nd generation. Tez is one of them that introduced interactive programing in addition to the batch processing.

Generations of Big Data Analytics

Apache Spark is 3rd generation tool for big data analytics. It is a general purpose batch and stream (kind of real-time) processing engine. It does support iterative processing as well which might be really helpful for machine learning and other type of processing. RDD (Resilient Distributed Dataset) is at the core of the spark. It is very much faster than Map Reduce as it does in-memory computation and optimize processing. It does provide high level APIs in Java, Scala and Python.

Apache Flink is the next gen or 4th generation stream processing framework. It is open source and provide real time stream processing compared to other frameworks. It was conceptually designed to run everything in a streaming fashion even when we run batch processing. It also support iterative processing and stateful stream processing computations. DataSet and DataStream are core APIs but we have multiple high level APIs as well. It does provide Java, Scala, Python and SQL API for developing applications

References:
1. https://opensourceforu.com/2018/03/a-quick-comparison-of-the-five-best-big-data-frameworks/
2. https://www.slideshare.net/sbaltagi/overview-of-apacheflinkbyslimbaltagi
3. https://en.wikipedia.org/wiki/Apache_Flink

--

--

M Haseeb Asif
Big Data Processing

Technical writer, teacher and passionate data engineer. Love to talk, write and code with Apache Spark, Flink or anything related to data