PySpark Cheat Sheet For Big Data Analytics

Tatev Karen Aslanyan

Follow

Published in

Analytics Vidhya

9 min readMar 31, 2021

--

Michelangelo has said once “If people knew how hard I had to work to gain my mastery, it would not seem so wonderful at all”. We all know that Practice Makes Perfect which means that to learn a new skill, whether it’s painting, singing or performing data analysis, you need to consistently practice and repeat implementation of this skill .

Nowadays, PySpark, is a must have for every data scientist conducting data analysis or experiments involving big data. Knowing all PySpark functionalities in a short amount of time can be hard but there are essential functions that you simply must know to start your big data analysis in PySpark.

Here is a cheat sheet for the essential PySpark commands and functions.

Loading Data

First and essential step in every analysis is to load the data. You can load all sort of data of different formats such a CSV, JSON, Parquet, TXT etc., which all can be loaded in PySpark. For this article we have used Stroke Prediction Dataset publicly available on Kaggle

CSV format

Here is how you can do it if your data is in a CSV format.

Parquet format

You can use similar code to load a data that is in Parquet format or in any other format.

Viewing Data

check your code at each intermediate steps
verify whether your code is working as expected
helps to identify where exactly the bug lies in you code
helps to improve your code

Display()

To view the data or any dataframe in general you can use the display() command. This will help you to validate your data and you can also use it for verifying the intermediate steps when making adjustments to your data.

Display command will show you the first 1000 observations in your data but if you want to see fewer observations, for example 20 id’s, you can limit the amount of observations by using limit(20) command.

Show()

Another PySpark command that will allow to view the data is the show() command, where inside the parentheses you can specify the amount of observations you want to view, which is similar to the limit() functionality.

Print()

If you want to print something that has a single value, then Print() command is what you need. You can print for instance number of observations, using Count(), in your data or any other single value.

Selecting Data

When your data contains many data fields and you would like to keep only part of them then you can use the Select() command, where you need to specify the names of the variables in your dataframe you would like to keep. Following code shows how to select variables called id, gender, age, stroke.

*Image by author:* Selected data where the desired 4 variables are kept only

If you wish to rename any of all the names of the variables while selecting them, you can use SelectExpr(). Following code selected the variables id, gender, age, stroke while renaming them as ID, Gender, Age, Stroke.

*Image by author:* Selected data where the desired 4 variables are renamed and kept

Counting Data

Count() is one of the most essential commands in PySpark whether you are simply checking whether you have the right amount of data with correct amount of observations, or when you are calculating descriptive statistics for your dataset. For example, we count here the number of ID’s that exist in this data set, that is the number of rows that the ID data field has.

*Image by author:* Number of rows in column ID

If you want to count the unique or distinct values of a variable, then you can us Distinct() command in combination with count(). For example to count the distinct ID’s in the dataframe which in this case is equal to the number of ID’s in general.

*Image by author:* Number of unique rows in column ID

Unique Values

You can also use the Distinct() command individually, to find the unique values that a variable has. For example, here we find the unique values that variable Stroke has.

*Image by author:* Distinct values of variable Stroke

Filtering Data

To filter data for certain values you can use filter() command. For example you can keep only the data of people who has a stroke, that is who has value 1 in a dummy variable Stroke.

*Image by author:* Filtered data of observations who reported having stroke in the past

Then you can combine filter() with count() to, for example, count the number of observations who had a stroke.

*Image by author:* Number of observations who reporting having stroke in the past

You can also apply filtering for multiple times to filter your data for multiple conditions. For example, to count the number of females who had a stroke or to determine how many males in this sample had a stroke.

*Image by author:* Number of females and males who had a stroke in the past, respectively

You can also use filter() in combination with count() to obtain percentages, such as percentage of females or males, in the sample that had a stroke, respectively.

*Image by author:* 4.7% of all females had a stroke and 5.1% of all males reported having a stroke in the past

If you want to filter your data based on multiple values of target variable, you can use combinations of filter() with isin(). In this example we filter the data on Age and we keep observations whose Age falls in the range [80, 81, 82].

*Image by author:* Filtered data of observations who have an age 80, 81 or 82

Ordering Data

To order your data based on certain variable you can use orderBy() where you need the specify the variable based on which you wish to order the data. By default, Spark will order the data in ascending order, if not specified otherwise. For example, here we order the data based on the Age variable.

As an alternative to orderBy() command, you can use sort() to order your data. If you want to order the data based on descending order you need to specify this and use the desc() function which you firstly need to import. You can, for example, order the data based on the variable Age in descending order (the data of oldest people comes first).

Creating New Variables

To create a new variable, you can use withColumn() which is usually being used in combination with other PySpark functions. For example, we create new variable called gender_lower, by using withColumn() and lower(), that converts the original Gender variable values from capital letter to lower letters.

Deleting Data

If you want to delete a certain variable or observation you can do that by using drop() command while specifying the name of the variable you would like to drop. For example, here we drop the earlier created variable gender_lower.

Changing Data Types

If you want to change the datatype of your variable, for example from String to Integers, then you can use cast() to perform this transformation. There are many data types you can use for this purposes such as IntegerType, StringType, DoubleType, LongType etc.

Here we use combination of withColumn(), col(), cast() and IntegerType() from types module, which needs to be imported, to transform Age variable into integer type.

Conditions

In PySpark you can apply conditional operations in multiple ways. You can either write a Python function and apply it to your data by using User Defined Functions (UDFs) or using PySpark command when().otherwise(). For example, here we create a new gender variable, binary variable that takes value 1 if the original Gender variable is equal to Female and otherwise takes value 0. You can have also nested when().otherwise() operations when you have more than 2 possible values.

Data Aggregation

When it comes to aggregation or grouping, then groupBy().agg() is what you need. So, this operation consists of two parts, the groupBy(X) will group your data per unique value of variable X specified as an argument whereas inside the agg() you need to specify the type of aggregation operation you would like to apply and on which variable in your dataset. This can be summing all values per group using sum() function, or obtaining the average value per group by using avg() function. Additionally, you fund the maximum or minimum values for certain variable per group using max() or min() functions, respectively. You might also want to collect data and create a PySpark list per group which can be done by using collect_list() function.

Summation

Here we obtain the number of stokes per gender type by grouping per values in variable Gender and by summing up all values in variable Stroke, given that is a binary variable.

*Image by author:* In this sample, 141 females and 108 males reported having a stroke in the past

Maximum

In this example we determine per gender type the maximum age of a person. That is we group based on variable Gender and then find the maximum value in variable Age.

*Image by author:* The oldest women in the sample is 82 years old, the same holds for the oldest male in the sample

Minimum

Another example, is obtaining the group minimum where we determine per gender type the minimum age of a person. That is we group based on variable Gender and then find the minimum value in variable Age.

*Image by author:* The youngest female in the sample is 0 years old, the same holds for the youngest male in the sample

Collecting Group Data

Here collect values of variable Stroke and store it in a list per gender type. Then, we use alias() command to name this newly created variable.

Average

Another popular aggregation is obtaining the group average. In this example we compute the average stroke rate per gender type.

Additional Resources

TatevKaren — Overview

I’m Tatev Karen Aslanyan, Data Scientist and Quantitative Analyst with strong background in Mathematics, Statistics…

github.com

If you liked this article, here are some other articles you may enjoy:

Data Sampling Methods in Python

A ready-to-run code with different data sampling techniques to create a random sample in Python

tatev-aslanyan.medium.com

Fundamentals Of Statistics For Data Scientists and Data Analysts

Key statistical concepts for your data science or data analytics journey

towardsdatascience.com

Simple and Complete Guide to A/B Testing

End-to-end A/B testing for your Data Science experiments for non-technical and technical specialists with examples and…

towardsdatascience.com

Monte Carlo Simulation and Variants with Python

Your Guide to Monte Carlo Simulation and Must Know Statistical Sampling Techniques With Python Implementation

towardsdatascience.com

Thanks for the read

I encourage you to join Medium today to have complete access to all of the great locked content published across Medium and on my feed where I publish about various Data Science, Machine Learning, and Deep Learning topics.

Follow me up on Medium to read more articles about various Data Science and Data Analytics topics. For more hands-on applications of Machine Learning, Mathematical and Statistical concepts check out my Github account.
I welcome feedback and can be reached out on LinkedIn.

Happy learning!

PySpark Cheat Sheet For Big Data Analytics

Loading Data

CSV format

Parquet format

Viewing Data

Display()

Show()

Print()

Selecting Data

Counting Data

Unique Values

Filtering Data

Ordering Data

Creating New Variables

Deleting Data

Changing Data Types

Conditions

Data Aggregation

Summation

Maximum

Minimum

Collecting Group Data

Average

Additional Resources

TatevKaren — Overview

I’m Tatev Karen Aslanyan, Data Scientist and Quantitative Analyst with strong background in Mathematics, Statistics…

If you liked this article, here are some other articles you may enjoy:

Data Sampling Methods in Python

A ready-to-run code with different data sampling techniques to create a random sample in Python

Fundamentals Of Statistics For Data Scientists and Data Analysts

Key statistical concepts for your data science or data analytics journey

Simple and Complete Guide to A/B Testing

End-to-end A/B testing for your Data Science experiments for non-technical and technical specialists with examples and…

Monte Carlo Simulation and Variants with Python

Your Guide to Monte Carlo Simulation and Must Know Statistical Sampling Techniques With Python Implementation

Written by Tatev Karen Aslanyan