Explanation of PySpark and Coding

Anandaram Ganpathi
Analytics Vidhya
Published in
5 min readSep 21, 2021

Apache Spark:

Apache Spark is a powerful ETL tool used to analyse Big Data. This tool is widely used by Data Engineer and Data Scientist in the industry nowadays. The Spark is written in the language of Scala. The Spark tool can be performed in various programming languages like R, Python, Java, JavaScript, Scala and .Net, But the spark is widely used in Python, Java and Scala languages.

So what’s PySpark, It’s nothing but using spark in python language are called PySpark. Let’s concentrate on Spark using Python.

It’s good to use spark in python because python is one of the languages that have a higher developer community, more library support than any other competing language with more than 130 thousand libraries. Using python you can perform high-level tasks like spark streaming, Graphx, Spark SQL and MLlib (Machine Learning). Yes, you can do Machine Learning using spark.

Spark Core

RDD:

Stands for Resilient Distributed Dataset, RDD is a read-only collection of objects that is partitioned across multiple machines in a cluster. In a typical Spark program, one or more RDDs are loaded as input and through a series of transformations are turned into a set of target RDDs

i. A fault-tolerant, distributed collection of objects.

ii. In spark, every task is expressed in the following ways:

  • Creating new RDD(s)
  • Transforming existing RDD(s)
  • Calling operations on RDD(s)

RDD only support two types of operations Transformation and Actions.

Transformations: Create a new dataset from an existing one to perform map, flatMap, filter, union, sample, join, groupByKey.

Actions: Return a value to the driver program after running a computation on the dataset to perform collect, reduce, count, save.

RDD’s Process

For more information, please follow the link here.

PySpark Coding (Hands-on):

To import the required libraries kindly use the following code.

import pyspark
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession
from pyspark.sql import functions as fs

Let’s create a spark session.

spark = SparkSession.builder.appName(‘Medium’).getOrCreate()
spark

The above code will create a spark session which is nothing but a cluster that can monitor and control any other operations in it. You can give any name in to create a cluster appName(‘Preferred_name’).

Spark Session

Once the spark session is created you can click on Spark UI which will show in output that will take you to the new spark portal to monitor your all operations.

The PySpark syntax is so similar to Pandas with some unique differences, Now let’s start importing data and do some basic operations.

df=spark.read.csv('ramen_rating.csv',header=True,inferSchema=True)
df.show()

In the above code, I have imported a file called ramen_rating which is available in Kaggle. I recommend users to give header and inferSchema during importing the file the header is for the column name and inferSchema is for data type to pick from the file. If in case you didn’t use inferSchema the code will automatically pick any data type for the column more often it picks string as the datatype for any column, so I suggest you use inferSchema.

Unlike Pandas, the spark will not show the data once you called the variable df so we need to give df.show() command to see the data stored in the variable.

This code will help you to find the datatype or Schema for each column in the table df.printSchema()

The code is used to call single or multicolumn to display df.select(‘Brand’).show()

You can use describe as same that we use in Pandas df.describe().show()

Now, Let’s do some arithmetic operations.

df = df.withColumn(‘Stars_10’,df[‘Stars’]+5) 
df.show()

The above code will create a new column with arithmetic, I have taken a star column where the mode is 5, So I’m adding 5 to create a new column and I assign the column name as Stars_10.

You can also drop the column same as Pandas we use df=df.drop(‘Stars_10’)

df = df.withColumnRenamed(‘Stars’,’Stars_5')
df.show()

Here I have changed Stars to Stars_5 by using the above code.

df = df.withColumn('Product (Type)',fs.concat(df.Brand,fs.lit(' ('),df.Style,fs.lit(')')))
df.show()

In initial, it may seem complex or confused with Pandas syntax, but it’s easy once you understand and practice with spark. Here I have concat to columns Product (Type) using Brandand Stylecolumn. Inside withcolumn you need to give the new column name followed by concat syntax to add two columns.

Temp View:

I can smell it most people are waiting for this. In spark, you can perform SQL syntax by connecting data from any other environment such as Hadoop, Big Query, etc.

df.createTempView("temp")
df_sql = spark.sql("SELECT *, avg(Stars_5) over(PARTITION BY Style) as Average FROM temp order by Brand asc, Country desc")

In the above code, I have created a temporary view to use SQL command and followed by I have performed a window function.

Filter Operations:

Unlike Pandas you can’t filter the data directly by giving any condition. So in Spark, we use the filter option to filter the data in a range or by not null values or in any conditions.

df.filter("Stars_5<3 and Stars_5>1.5").show()

This command is used to filter the data by mentioning the data should be lesser than 3 at the same time it should be greater than 1.5 in Stars_5 column.

Group By:

The group by operation in spark are the same as Pandas. The below-mentioned code will calculate mean based on Style column.

df.groupby(‘Style’).mean().show()

You can also perform two different operations in aggeration in two columns in a single command.

df.groupby('Brand').agg({'Stars_5':"sum","Percentage":"mean"}).show()

Here I have performed adding (sum) of Stars_5 columns and calculating mean or average for a column Percentage by grouping the column Brand.

You can also use multiple columns in a single group by commands.

I tried to cover as much as basics and widely used operations in this article. I have attached my PySpark work in jupyter notebook format in GitHub, which will be more explaining to you in the coding part. Please find the link, which is placed below the article.

Stay turned for RDD’s and Machine Learning using PySpark in the next article Explanation of PySpark and Coding II.

To continue reading the next article, please do follow me and a clap if this was a good read. Enjoy! 💙

Thanks for reading. :)

GitHub: https://github.com/anand-lab-172/PySpark

--

--

Anandaram Ganpathi
Analytics Vidhya

I’m a Big Data Engineer and Data Scientist with good programming knowledge and skills in Python, SQL, Machine Learning, Google Cloud, Tableau, EDA, Talend.