Apache Spark — Bits and Bytes

A Sparking Framework for Big Data Processing

Prasadi Abeywardana
5 min readJun 28, 2020
Photo by Stephanie McCabe on Unsplash

Apache Spark is a technology that owns a significant place in the overall big data tech stack as well as the Hadoop eco-system. There is a high chance that even a beginner in big data engineering is familiar with the word “Spark”, due to the spark it has created within big data communities. But do you know the fundamentals behind it and its capabilities? To be honest I did not know when I started, even though I was very familiar with the term “Apache Spark”.

Apache Spark is a unified analytics engine for large-scale data processing.

This definition ensures that Apache Spark is neither a programming language nor a simple library, but rather a sophisticated framework which generalizes some functionalities required by big data processing. Let’s have a look at what are the capabilities offered by Apache Spark for big data engineers.

Components of Apache Spark: Image Source

The above image depicts Apache Spark in a nutshell. Core functionalities of Apache Spark can be categorized as follows.

  • MLlib: This is Apache Spark’s machine learning library. Inheriting from the features of Apache Spark, any implementation done using this library is scalable and can be easily fit into Hadoop based workflows. Hadoop data sources like HDFS and HBase can be easily used with Spark MLlib. MLlib cooperates well with other pieces of Spark like Streaming and SQL. Depending on Spark’s iterative computations MLlib runs algorithms pretty fast. MLlib contains a wide range of ML algorithms and utility functions (for data pre-processing, transformation and parameter tuning etc.)
  • Spark Streaming: This is Apache Spark’s stream processing library. Similar to MLlib, this too can be easily integrated with other pieces of Hadoop eco-system and used in building interactive applications with streaming data. Spark streaming provides an out of the box fault tolerance mechanism. Streaming jobs can be written using a language like Java, Python or Scala using Spark’s language integrated API: https://spark.apache.org/docs/latest/streaming-programming-guide.html
  • Spark SQL: This is Apache Spark’s library for working with structured data. Spark SQL allows us to mix SQL queries with Spark programs written using Java, Python, R or Scala. This can be done using standard SQL or DataFrame API (https://spark.apache.org/docs/latest/sql-programming-guide.html). This allows integrating with a variety of data sources like HDFS, Hive, JSON, JDBC, Parquet etc.
  • GraphX: This is Apache Spark’s library for graph processing. This allows fast processing of graphs. And also, it unifies different graph operations such as exploratory analysis, ETL and iterative graph computation into a single system.

If you pay attention to Spark Core, you can see, programming can be done using the language of your choice out of Python, Java, R or Scala. Many prefer to do Spark programming with Python or Java, but Scala can never be forgotten when Spark is mentioned. Spark itself is written using Scala and it can bring out many advantages of Spark, so many production level systems are consuming Spark with Scala. So, I would say, it is a good idea to start learning Scala and you will understand how similar it is to Python once you started.

Now let’s dig a little deeper into some fundamental building blocks of Apache Spark.

Resilient Distributed Datasets (RDDs)

Spark is centered around a concept called RDDs. RDDs are known to be fault-tolerant collections of elements that can be operated on in parallel. Each Spark program has a SparkContext (https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html) which is the main entry point to Spark functionality. RDDs are created by this SparkContext. RDDs can be created in two ways.

  1. Parallelizing an existing collection: This can be achieved using SparkContext’s parallelize method. Once the normal collection is converted into a RDD, elements in it can be operated in parallel.
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

2. Referencing an external data set: RDDs can be created from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. Text file RDDs can be created using SparkContext’s textFile method.

distFile = sc.textFile("data.txt")

Once an RDD is created, there are various parallel operations that can be applied on the elements of a RDD. RDD operations can be split into two categories.

  1. Transformations: These operations create a new data set from an existing one. map, flatMap, filter, distinct, sample are some of the transformation operations. They create a new RDD after being applied on another RDD.
  2. Actions: · These operations return a value after performing a computation on an RDD. collect, count, reduce, take and top are some example of actions.

RDD transformation are lazy executions, that means they do not compute any result right away. The actual computation happens once an action is called. But a graph of transformations is built and remembered to be utilized when required. This design enhances the overall efficiency of Spark.

This is how a transformation and an action are applied on an RDD.

lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

More details about RDDs can be found here: https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds

DataFrames

DataFrames are an extension to RDDs which makes them easier to work with. A DataFrame can be introduced as a data set organized into named columns or a DataFrame contains row objects with a schema. Because of this we can run SQL queries on DataFrames. It is conceptually equivalent to a table in a relational database and it can be constructed from a wide range of data sources such as structured data files, tables in Hive, external databases, or existing RDDs.

From Spark 2.0 onwards SparkSession is the substitute for SparkContext with the ability to create both RDDs and DataFrames.

Even though Apache Spark is almost a decade old, it is continuously evolving and widely embraced by industry giants. Apache Spark 3.0 was released this month with a lot of enhancements specially with features such as improved support for deep learning and better Kubernetes support etc. Undoubtedly it will remain a giant in the world of Big Data for a long time. Therefore, it will always be a good idea to add it to your skill set. Why don’t you start now? https://spark.apache.org/

--

--

Prasadi Abeywardana

A software engineer by profession. A Technical Lead. Data Science Enthusiast. Expertised in Big Data Privacy. A writer by hobby.