Explanation of PySpark and Coding
Apache Spark:
Apache Spark is a powerful ETL tool used to analyse Big Data. This tool is widely used by Data Engineer and Data Scientist in the industry nowadays. The Spark is written in the language of Scala. The Spark tool can be performed in various programming languages like R, Python, Java, JavaScript, Scala and .Net, But the spark is widely used in Python, Java and Scala languages.
So what’s PySpark, It’s nothing but using spark in python language are called PySpark. Let’s concentrate on Spark using Python.
It’s good to use spark in python because python is one of the languages that have a higher developer community, more library support than any other competing language with more than 130 thousand libraries. Using python you can perform high-level tasks like spark streaming, Graphx, Spark SQL and MLlib (Machine Learning). Yes, you can do Machine Learning using spark.
RDD:
Stands for Resilient Distributed Dataset, RDD is a read-only collection of objects that is partitioned across multiple machines in a cluster. In a typical Spark program, one or more RDDs are loaded as input and through a series of transformations are turned into a set of target RDDs
i. A fault-tolerant, distributed collection of objects.
ii. In spark, every task is expressed in the following ways:
- Creating new
RDD(s)
- Transforming existing
RDD(s)
- Calling operations on
RDD(s)
RDD only support two types of operations Transformation and Actions.
Transformations: Create a new dataset from an existing one to perform map, flatMap, filter, union, sample, join, groupByKey.
Actions: Return a value to the driver program after running a computation on the dataset to perform collect, reduce, count, save.
For more information, please follow the link here.
PySpark Coding (Hands-on):
To import the required libraries kindly use the following code.
import pyspark
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
from pyspark.sql import functions as fs
Let’s create a spark session.
spark = SparkSession.builder.appName(‘Medium’).getOrCreate()
spark
The above code will create a spark session which is nothing but a cluster that can monitor and control any other operations in it. You can give any name in to create a cluster appName(‘Preferred_name’).
Once the spark session is created you can click on Spark UI which will show in output that will take you to the new spark portal to monitor your all operations.
The PySpark syntax is so similar to Pandas with some unique differences, Now let’s start importing data and do some basic operations.
df=spark.read.csv('ramen_rating.csv',header=True,inferSchema=True)
df.show()
In the above code, I have imported a file called ramen_rating
which is available in Kaggle. I recommend users to give header and inferSchema during importing the file the header is for the column name and inferSchema is for data type to pick from the file. If in case you didn’t use inferSchema the code will automatically pick any data type for the column more often it picks string
as the datatype for any column, so I suggest you use inferSchema.
Unlike Pandas, the spark will not show the data once you called the variable df
so we need to give df.show()
command to see the data stored in the variable.
This code will help you to find the datatype or Schema for each column in the table df.printSchema()
The code is used to call single or multicolumn to display df.select(‘Brand’).show()
You can use describe as same that we use in Pandas df.describe().show()
Now, Let’s do some arithmetic operations.
df = df.withColumn(‘Stars_10’,df[‘Stars’]+5)
df.show()
The above code will create a new column with arithmetic, I have taken a star column where the mode
is 5, So I’m adding 5 to create a new column and I assign the column name as Stars_10.
You can also drop the column same as Pandas we use df=df.drop(‘Stars_10’)
df = df.withColumnRenamed(‘Stars’,’Stars_5')
df.show()
Here I have changed Stars
to Stars_5
by using the above code.
df = df.withColumn('Product (Type)',fs.concat(df.Brand,fs.lit(' ('),df.Style,fs.lit(')')))
df.show()
In initial, it may seem complex or confused with Pandas syntax, but it’s easy once you understand and practice with spark. Here I have concat
to columns Product (Type)
using Brand
and Style
column. Inside withcolumn
you need to give the new column name followed by concat
syntax to add two columns.
Temp View:
I can smell it most people are waiting for this. In spark, you can perform SQL syntax by connecting data from any other environment such as Hadoop, Big Query, etc.
df.createTempView("temp")
df_sql = spark.sql("SELECT *, avg(Stars_5) over(PARTITION BY Style) as Average FROM temp order by Brand asc, Country desc")
In the above code, I have created a temporary view to use SQL command and followed by I have performed a window function.
Filter Operations:
Unlike Pandas you can’t filter the data directly by giving any condition. So in Spark, we use the filter option to filter the data in a range or by not null values or in any conditions.
df.filter("Stars_5<3 and Stars_5>1.5").show()
This command is used to filter the data by mentioning the data should be lesser than 3 at the same time it should be greater than 1.5 in Stars_5
column.
Group By:
The group by operation in spark are the same as Pandas. The below-mentioned code will calculate mean
based on Style
column.
df.groupby(‘Style’).mean().show()
You can also perform two different operations in aggeration in two columns in a single command.
df.groupby('Brand').agg({'Stars_5':"sum","Percentage":"mean"}).show()
Here I have performed adding (sum) of Stars_5
columns and calculating mean or average for a column Percentage
by grouping the column Brand
.
You can also use multiple columns in a single group by commands.
I tried to cover as much as basics and widely used operations in this article. I have attached my PySpark work in jupyter notebook format in GitHub, which will be more explaining to you in the coding part. Please find the link, which is placed below the article.
Stay turned for RDD’s and Machine Learning using PySpark in the next article Explanation of PySpark and Coding II.
To continue reading the next article, please do follow me and a clap if this was a good read. Enjoy! 💙
Thanks for reading. :)