Today’s cluster computing arena spark is getting used for its fast and scalable application model. while comparing spark with traditional map-reduce, it provides In-memory computing which is 10x faster and provides real-time data processing with Spark streams.

RDDs

Spark provides a distributed collection object which is immutable and called Resilient distributed data. RDDs are one of the core components of Spark and it is split into multiple partitions and processed in multiple nodes of the cluster.

Spark contains 4 main integrated components as below.

  1. Spark Core — RDDs
  2. Spark — SQL
  3. Spark — Streaming
  4. MLib — Machine Learning Libraries.

Instead…

Ramesh Ganesan

Data Engineering enthusiast | Big Data | Python | SQL

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store