PySpark Cheat Sheet For Big Data Analytics
Michelangelo has said once “If people knew how hard I had to work to gain my mastery, it would not seem so wonderful at all”. We all know that Practice Makes Perfect which means that to learn a new skill, whether it’s painting, singing or performing data analysis, you need to consistently practice and repeat implementation of this skill .
Nowadays, PySpark, is a must have for every data scientist conducting data analysis or experiments involving big data. Knowing all PySpark functionalities in a short amount of time can be hard but there are essential functions that you simply must know to start your big data analysis in PySpark.
Here is a cheat sheet for the essential PySpark commands and functions.
Loading Data
First and essential step in every analysis is to load the data. You can load all sort of data of different formats such a CSV, JSON, Parquet, TXT etc., which all can be loaded in PySpark. For this article we have used Stroke Prediction Dataset publicly available on Kaggle
CSV format
Here is how you can do it if your data is in a CSV format.
Parquet format
You can use similar code to load a data that is in Parquet format or in any other format.
Viewing Data
- check your code at each intermediate steps
- verify whether your code is working as expected
- helps to identify where exactly the bug lies in you code
- helps to improve your code
Display()
To view the data or any dataframe in general you can use the display() command. This will help you to validate your data and you can also use it for verifying the intermediate steps when making adjustments to your data.
Display command will show you the first 1000 observations in your data but if you want to see fewer observations, for example 20 id’s, you can limit the amount of observations by using limit(20) command.
Show()
Another PySpark command that will allow to view the data is the show() command, where inside the parentheses you can specify the amount of observations you want to view, which is similar to the limit() functionality.
Print()
If you want to print something that has a single value, then Print() command is what you need. You can print for instance number of observations, using Count(), in your data or any other single value.
Selecting Data
When your data contains many data fields and you would like to keep only part of them then you can use the Select() command, where you need to specify the names of the variables in your dataframe you would like to keep. Following code shows how to select variables called id, gender, age, stroke.
If you wish to rename any of all the names of the variables while selecting them, you can use SelectExpr(). Following code selected the variables id, gender, age, stroke while renaming them as ID, Gender, Age, Stroke.
Counting Data
Count() is one of the most essential commands in PySpark whether you are simply checking whether you have the right amount of data with correct amount of observations, or when you are calculating descriptive statistics for your dataset. For example, we count here the number of ID’s that exist in this data set, that is the number of rows that the ID data field has.
If you want to count the unique or distinct values of a variable, then you can us Distinct() command in combination with count(). For example to count the distinct ID’s in the dataframe which in this case is equal to the number of ID’s in general.
Unique Values
You can also use the Distinct() command individually, to find the unique values that a variable has. For example, here we find the unique values that variable Stroke has.
Filtering Data
To filter data for certain values you can use filter() command. For example you can keep only the data of people who has a stroke, that is who has value 1 in a dummy variable Stroke.
Then you can combine filter() with count() to, for example, count the number of observations who had a stroke.
You can also apply filtering for multiple times to filter your data for multiple conditions. For example, to count the number of females who had a stroke or to determine how many males in this sample had a stroke.
You can also use filter() in combination with count() to obtain percentages, such as percentage of females or males, in the sample that had a stroke, respectively.
If you want to filter your data based on multiple values of target variable, you can use combinations of filter() with isin(). In this example we filter the data on Age and we keep observations whose Age falls in the range [80, 81, 82].
Ordering Data
To order your data based on certain variable you can use orderBy() where you need the specify the variable based on which you wish to order the data. By default, Spark will order the data in ascending order, if not specified otherwise. For example, here we order the data based on the Age variable.
As an alternative to orderBy() command, you can use sort() to order your data. If you want to order the data based on descending order you need to specify this and use the desc() function which you firstly need to import. You can, for example, order the data based on the variable Age in descending order (the data of oldest people comes first).
Creating New Variables
To create a new variable, you can use withColumn() which is usually being used in combination with other PySpark functions. For example, we create new variable called gender_lower, by using withColumn() and lower(), that converts the original Gender variable values from capital letter to lower letters.
Deleting Data
If you want to delete a certain variable or observation you can do that by using drop() command while specifying the name of the variable you would like to drop. For example, here we drop the earlier created variable gender_lower.
Changing Data Types
If you want to change the datatype of your variable, for example from String to Integers, then you can use cast() to perform this transformation. There are many data types you can use for this purposes such as IntegerType, StringType, DoubleType, LongType etc.
Here we use combination of withColumn(), col(), cast() and IntegerType() from types module, which needs to be imported, to transform Age variable into integer type.
Conditions
In PySpark you can apply conditional operations in multiple ways. You can either write a Python function and apply it to your data by using User Defined Functions (UDFs) or using PySpark command when().otherwise(). For example, here we create a new gender variable, binary variable that takes value 1 if the original Gender variable is equal to Female and otherwise takes value 0. You can have also nested when().otherwise() operations when you have more than 2 possible values.
Data Aggregation
When it comes to aggregation or grouping, then groupBy().agg() is what you need. So, this operation consists of two parts, the groupBy(X) will group your data per unique value of variable X specified as an argument whereas inside the agg() you need to specify the type of aggregation operation you would like to apply and on which variable in your dataset. This can be summing all values per group using sum() function, or obtaining the average value per group by using avg() function. Additionally, you fund the maximum or minimum values for certain variable per group using max() or min() functions, respectively. You might also want to collect data and create a PySpark list per group which can be done by using collect_list() function.
Summation
Here we obtain the number of stokes per gender type by grouping per values in variable Gender and by summing up all values in variable Stroke, given that is a binary variable.
Maximum
In this example we determine per gender type the maximum age of a person. That is we group based on variable Gender and then find the maximum value in variable Age.
Minimum
Another example, is obtaining the group minimum where we determine per gender type the minimum age of a person. That is we group based on variable Gender and then find the minimum value in variable Age.
Collecting Group Data
Here collect values of variable Stroke and store it in a list per gender type. Then, we use alias() command to name this newly created variable.
Average
Another popular aggregation is obtaining the group average. In this example we compute the average stroke rate per gender type.
If you liked this article, here are some other articles you may enjoy:
Thanks for the read
I encourage you to join Medium today to have complete access to all of the great locked content published across Medium and on my feed where I publish about various Data Science, Machine Learning, and Deep Learning topics.
Follow me up on Medium to read more articles about various Data Science and Data Analytics topics. For more hands-on applications of Machine Learning, Mathematical and Statistical concepts check out my Github account.
I welcome feedback and can be reached out on LinkedIn.
Happy learning!