Getting started with Apache Spark -PART 1

Published in

Big Apps Tech

4 min readApr 7, 2020

At Big Apps, witch is consulting company specialized in Big Data fields, especially manipulating a huge volumes, witch requires a lot of technics and tools.

Today, I’ll make a general view about apache spark, how to install it on Windows and Linux, then presents some spark APIs, with illustrating them with some simple examples.

What is Spark ?

Apache spark is an open source framework, which allows developers to carry out a processing of a large volume of data in a distributed way (cluster computing) . it exposes a functional api which allows us to make transformation and processing like map, reduce,.. and also applying aggregations.

Scala and spark Installation :

1- Windows:

scala :

1. Download Scala from : https://www.scala-lang.org/download/

2. Define the environment variables as next:

SCALA_HOME => C: \ Program Files (x86) \ scala
PATH => C: \ Program Files (x86) \ scala \ bin

Spark:

1. Download spark from : https://spark.apache.org/downloads.html

2. Download Windows utilities from: https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1

3. Define the environment variables as next:

HADOOP_HOME => D:\spark-2.0.1-bin-hadoop2.7
SPARK_HOME => D:\spark-2.0.1-bin-hadoop2.7\bin
PATH => D:\spark\spark-2.0.1-bin-hadoop2.7\bin

2-Linux (Ubuntu)

Scala:

sudo apt-get update
sudo apt-get install scala

Spark:

Using linux commande line :

wget http://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
tar xvf spark-2.2.0-bin-hadoop2.7.tgz

Set the environment variables:

export PATH=$PATH:/usr/local/spark/bin

SparkSession:

The sparksession is the entry point of spark, so in order to use spark we should create the spark session as follows:

where the Builder is the constructor of the spark session ,

Master : define the main Spark URL to connect to, in our example it’s “local” to run it locally.

getOrCreat : to obtain the sparksession if it exists or create a new one.

Spark APIs

1. Rdd (Resilient Distributed Dataset ) it is a resilient data structure, is distributed on the different spark executors. of course we can applying processing and transformations and computation in memory in the clusters.RDD can also be cached and partitioned manually. Caching is beneficial when using RDD multiple times. And manual partitioning is important to properly balance partitions.

2. DataFrame In a data frame, the data is organized in named columns, we manipulate rows, we can say that data frame in an improvement of RDD , because in addition to the data he provided the schema.

It allows data processing in different formats: AVRO, CSV, JSON and HDFS storage system, HIVE tables, MySQL.

Optimisation takes place with the help of the catalyst optimiser. They use a catalytic shaft transformation frame :

1. Analyzing a logical plan to resolve references.
2. Logical plan optimization.
3. Physical planning.
4. Code generation to compile parts of the query to Java byte code.

It can serialize data in binary format and then perform many transformations. It provides a Tungsten physical execution backend so there is no need to use Java serialization to encode the data.

Example of transformation : suggesting that we have a dataframe of people, which contains informations about all of them, and among this information their residence cities, and we want to extract only whose living in Paris , for this purpose we use a filter:

Result :

3. Dataset The main difference with DataFrame is that in a Dataset we manipulate objects, it is also optimized by the catalyst optimizer, to use it we must create an encoder.

An encoder is a scala object which imports the schema from the case classes.

Dataset allows data processing in various formats : AVRO, CSV, JSON and HDFS data storage system, HIVE tables, MySQL, It also supports data from different sources

It Offers security of the compilation type :

It has a concept of an encoder that handles the conversion between objects in the JVM into a tabular representation and stores the tabular representation using the internal binary format Tungsten spark, which allows the operation to be performed on serialized data and to access an attribute without deserializing the entire object.

Example of transformation :

1. Firstly, I Create a case class and its encoder :