Getting started with Apache Spark III

Udbhav Pangotra
Geek Culture
Published in
5 min readJan 6, 2022

The third installment in your Apache Spark Journey!

Photo by AltumCode on Unsplash

Now we go towards the data structures and some other more in-depth topics in Spark. There are three types of data structures in Spark which can hold the data:

  1. RDD (Resilient Distributed Datasets)
  2. DataSet
  3. DataFrame

Resilient Distributed Datasets (RDD)
The fundamental(legacy) data structure of spark which can be described as :
- Immutable distributed collections of elements of the data
- Fault tolerant
- Partitioned across the nodes of the cluster that can run in parallel

Source : Slideshare

Creating RDDs:

  • Parallelize collections
    from pyspark import SparkContext
    sc = SparkContext.getOrCreate()
    data = [1,2,3,4]
    distData = sc.parallelize(data)
  • External Datasets
    distFile = sc.textFile(“data.txt”)
  • From another RDD

DataSet
Strongly typed data structure in Spark SQL, it is an extension of the dataframe API

Dataframe
It is a distributed collection of data which is grouped into named columns, a DataFrame can be equated to a database table.

Creating a DataFrame:
NewDF = spark.read.parquet(“file_path”)

Parallelism and Partitions
Two main factors that control the parallelism in Spark are
1. Number of partitions
2. Number of executors

ETL/ELT process

ETL(Extract, Transform and Load):
Extract : The process of extracting data from multiple sources
Transform : Applying business logic on the data extracted and discarding the data not in need
Load : Loading the data in the finalized table(s), or as files

ELT (Extract, Load and Transform) :
The load and transform part is switched as there are cases in which we need raw data as is for analytics and data science purpose
We keep the raw data and let the teams transform the data as per need.

Data Warehouse and Data Lakes

Data Warehouse :
Built for the purpose of OLAP
Focus on one business area
Usually contained data in more structured form
The end usage are for generating the reports, for downstream applications and web interfaces
The most popular database for data warehousing are — Oracle, Hive, GBQ, Snowflake

Data Lake :
The focus is on creating a centralized data availability to be used across multiple functional teams
The data is kept as is without much changes from the source
The end use can be for multiple users like — data science, data engineering, data analysts etc.
Data Cataloging can be done to tag the data for ease of usage

Operations on the data structures

We can do the below operations on RDDs, Dataset and DataFrames
Transformations
Actions

For instance we want to get the employees with salary greater than 100k$, we can do these kinds of transformations
Actions is when we collect the data, or we save it in a file or database. So it bring all the data we have in executors to our drivers. Once we do the actions, the jobs are created in the executors.

Different Transformations we have
Filtering the data
Changing the datatypes
Applying calculations on the fields
Joining
Merging
Pivot
Select and many more!

Types of Transformations

Narrow Transformations
- In this, each input partition will contribute to only one output partition
- No shuffle is required
- Can be pipelined into a single stage
- Map, filter, union etc.

Wide Transformations
- In this, input partitions will contribute to more than one output partition
- It defines a new stage
- Reduce by, group by, join etc.

Actions

An action instructs Spark to compute a result from a series of transformations
The action sends the data from the executors to the driver once completed
They can return raw values, or we can write them in file and tables

Some of the typical transformations that we can perform are:
- Collect
- Reduce by
- Count
- Take
- Top
- Aggregate

Code Execution in Spark

The code execution is done across the cluster, Once we create a Data Frame, or a dataset or a SQL code, then Spark will create a logical plan for its execution. This all is done through a optimizer that runs in the background.

Source : Learn to Spark

Logical plan:
A series of numerical or algebraic constructs, does not define how will the computation will be done.
Physical plan:
It is a physical plan which also defines how the computation will be done.

SQL Catalyst Optimizer

The translation of the logical plan to the physical plan is done by the catalyst optimizer. Optimization is nothing but the method by which the long running application or system is fine tuned and made some change to make application to manage resources effectively and reduce the processing time efficiently.

As you all know the performance of transformation done directly with RDD will not be that efficient and Spark SQL API dataframe as well as dataset out performs the RDD. Spark SQL runs with an optimization engine called Catalyst optimizer, which help the developer to optimize the queries built on top of both dataframe and dataset with making any changes to the source code.

Catalyst is one of the Spark SQL modular library that is supported by rule-based optimization and cost-based optimization. Rule-based optimization proposes the set of rules to find the number of ways in which query can be executed, whereas Cost-based optimization is a step to compute cost for all the rule-based queries and helps the engine to find the best suited approach to execute the SQL query.

The main focus of SQL Catalyst optimizer is to minimize the shuffle operations, minimize the data transfer between the executors

There are 4 main components as shown above :
- Analyzer
- Optimizer
- Physical planning
- Code generator

Spark Application

The components of spark applications mainly consist :-

  1. Creating the spark session and defining the application
  2. Connecting with the data sources
  3. Getting the input data in the from of RDDs and DataFrames
  4. Applying transformations and business logic
  5. Applying Actions
  6. Defining the mode of running the application
  7. Submitting the application

We’ll go through some code in the next post! :)

Other articles that might be interested in:
- Part of this series :

Getting started with Apache Spark — I | by Sam | Geek Culture | Jan, 2022 | Medium
Getting started with Apache Spark II | by Sam | Geek Culture | Jan, 2022 | Medium

Misc: Streamlit and Palmer Penguins. Binged Atypical last week on Netflix… | by Sam | Geek Culture | Medium
- Getting started with Streamlit. Use Streamlit to explain your EDA and… | by Sam | Geek Culture | Medium

Cheers and do follow for more such content! :)

You can now buy me a coffee too if you liked the content!
samunderscore12 is creating data science content! (buymeacoffee.com)

--

--