Beginner guideline to big data with Apache spark

Bagavathy Priya

Published in

Analytics Vidhya

4 min readSep 12, 2020

In this blog we are going to see big data, what is Apache spark and why do we need it?

Big data:

As the name says, big data referred to as the massive amount of data that cannot be stored and processed with the traditional computer system. But how do we define a dataset as big data? It depends on the three components:

Volume
Velocity
Variety

Volume: refers to the size of a data file eg: 10GB, 1 TB like that.

Velocity: in which scale the data is producing eg: 1kb/microsecond, 1Mb/s

Variety: refers to the type of the data eg: structure, unstructured, semi-structured.

According to the above components, we can define data as big data or not.

For example: If you want to attach a document of size 50 Mb in a mail, but the limit is 25 Mb, so the volume of 50 Mb data is referred to as the big data here.

So, in real-time usage in certain situations, we cannot be able to read and process some data with our traditional computer systems. In order to solve these kinds of problems, google file system released the game-changing thing called “google file system”. After that Hadoop distributed file system(HDFS) and many more came to the industry.

You can view the paper for the google file system in the below link

https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

Ok, let’s see what is the role of apache spark here?

Apache spark:

According to the definition from the apache spark documentation page “Apache Spark™ is a unified analytics engine for large-scale data processing.

“Apache Spark is an open-source data processing engine, which is used to process a large amount of data for data analyzing, machine learning, etc,. in a distributed manner.”

It is not a programming language, but it provides APIs for python, java, scala, and R programming language.

But why I go to Apache spark? Here are a few of them

Why do we need it?

In-memory execution: Apache spark makes the computation in RAM rather than the local memory, which makes the process faster than the Hadoop distributed file systems.

Lazy evaluation: In spark, the execution will not start until the action is performed, adds only the driver needs the data, which makes the process a more optimized manner.

Architecture:

The architecture of the spark system will be like

It works based on the master-slave architecture, where the drivers are called masters and the workers are called slaves. In simple, the master does not perform any computations and it maintains, tracks, and instructs the slaves to do the processing. The main part is spark context (sc) which is the instance of the spark context class which drives the program. The spark core can be implemented as a Resilient distributed dataset ( RDD) which is a dataset distributed across various clusters of the node in a parallel manner which speeds up the computation in a quick manner. Working with RDD we can do two types of operation, they are transformation and action. We will see when and why we use them?

Transformation:

We don’t do any kind of computation in the data, just kind of some transformation. The result of this operation on an RDD is another RDD.

Example:

Creation of RDD :

a = sc.parallelize(range(100))

It creates a list like [0,1,2,…100]

b = a.map(lambda x : x * 2)

It didn’t do any computation , just the transformation of x => 2x in the dataset , which results in another rdd.

Action:

As the name says, it is used to perform some action to get some kind of results on computation.

Example:

a.count() => It will give the number of elements in the RDD

b.collect() => It returns all the elements present.

Wrap up….

This is how we code using pyspark, which is the python API of apache spark. You can see the example of how to create a model in linear regression using pyspark in the GitHub link below.

https://github.com/bagavathypriyanavaneethan/BigData/raw/master/Pyspark/LinearRegwithSpark.ipynb

Thanks for the reading.

It would be great to hear from you about the blog via the comments.

Regards,

Bagavathy Priya N

Reference: https://spark.apache.org/