The Essential PySpark Cheat Sheet for All Data Engineers.

Revise All Pyspark Scripting and Practice wisely in less time.

Kamireddy Mahendra
Towards AI

--

Completing tasks blindly is less significant than executing them with optimal efficiency and effectiveness.

Image designed by author Kamireddy mahendra.

I am offering my services as a freelancer at minimal rates to build my portfolio and earn the trust of diverse clients.

Especially for personal (1 & 1) Training through Zoom.

You can contact me via my Upwork page or hire me for any assistance you need in any data analytics-related work or to teach any data-related Skills (SQL, Python, Power BI, Machine Learning, and Big Data) personally as a Tutor/Trainer. For more details visit my About section.

Contact me at mahendraee204@gmail.com

Here is My YouTube channel. Subscribe To get more information and to catch my latest updates Instantly.

The cheat sheet provided here will serve as a comprehensive resource to swiftly review the key aspects of PySpark, aiding in preparation for interviews or tackling data analysis tasks across various platforms like Databricks or Python-based coding environments.

With this tool at your disposal, you’ll be equipped to swiftly revise critical transformation techniques and data analysis methodologies integral to PySpark and related frameworks.

Let’s get Started

Before initiating any transformations or data analysis tasks using PySpark, establishing a Spark session is paramount. This initial step serves as the cornerstone of code execution within the Spark framework. Importantly, it sets the foundation for subsequent operations and enables seamless interaction with Spark’s functionalities.

To begin, one must import the requisite modules necessary for the intended operations, ensuring that all essential components are readily available for utilization. By adhering to these fundamental procedures, developers and data practitioners can effectively harness the power of PySpark for streamlined data processing and analysis workflows.

Let’s Get Started!!

from pyspark.sql import SparkSession
from pyspark.sql import types
from pyspark.sql.types import SparkType, StructField, StringType, IntegerType, DataType
from pyspark.sql.functions import col, date, year, time, sum, avg, upper, count, Broadcast, expr
from pyspark.sql import window
from pyspark.sql import functions as F
spark=SparkSession.builder.appName("application").getOrCreate()
#read any file as given either csv, excel, parquet, or Avro any format of data
data=spark.read.csv("filePath", header=True, inferschema=True) #if we want given data types as it is
schema=StructType([StructField("id",IntegerType),StructField("name",StringType),
StructField("dept", StringType)] #if we want our required data types then we use this
#also for better performance of executions we will be using our custom schema rather depending on inferschema

Let’s kickstart our PySpark application by first creating a Spark Session, the entry point to PySpark functionality.

Image designed by author Kamireddy Mahendra.

We’ll then proceed with performing various transformations and analyses on sample data.

data=[(1,'mahi', 100),(2,'mahendra', 200),(3,'harish',300),(4,'desh',400)]

schema=['id', 'name', 'salary']
#create a data frame
df=spark.createDataFrame(data,schema)
df.head()
df.show()
display(df)
Image designed by author Kamireddy Mahendra.

We can calculate the cumulative sum of salaries using the window function in PySpark.

a = Window().orderBy('id')
cumulative_sum = df.withColumn("cumulative_sum", sum("salary").over(a))
ans = cumulative_sum.orderBy('id')
ans.show()
Image designed by author Kamireddy Mahendra.
a = Window().orderBy('id')
cumulative_sum = df.withColumn("cumulative_sum", avg("salary").over(a))
ans = cumulative_sum.orderBy('id')
ans.show()
Image designed by author Kamireddy Mahendra.
emp=[(1,'mahi', 100,1),(2,'Mahendra', 200,2),(3,'harish',300,3),
(4,'desh',400,4)]

schema=['id', 'name', 'salary', 'dept_id']
#create a data frame
df=spark.createDataFrame(data,schema)
df.head()
df.show()
display(df)

dept=[(1,'HR'),(2,'sales'),(3,'DA'),(4,'IT')]
schema=['dept_id', 'department']
department=spark.createDataFrame(dept,schema)
display(department)
Image designed by author Kamireddy Mahendra.

Let’s join two Data Frames using a common attribute, which in this case is the dept_id. Here’s an example:

df=employee.join(department, "dept_id", "inner").select('id','name','salary','department')
display(df)
Image designed by author Kamireddy Mahendra.
df=employee.join(department, "dept_id", "right").select('name','department')
display(df)
Image designed by author Kamireddy Mahendra.

Let’s delve into detailing various PySpark transformations, actions, and other functionalities with executable code examples.

We’ll cover a range of operations, from basic transformations like filtering and grouping to more advanced techniques such as window functions.

Each code snippet will be accompanied by explanatory sentences to ensure a comprehensive understanding of the concepts.

Starting with data manipulation operations, we’ll explore filtering, selecting, and aggregating data using PySpark Data Frames. Subsequently, we’ll move on to transformations like joins, sorting, and window functions to manipulate and analyze data across multiple tables.

Additionally, we’ll showcase Pyspark support for machine learning tasks by demonstrating model training and evaluation using sample datasets.

Throughout this cheat sheet, each code snippet will serve as a practical demonstration of the corresponding concept, facilitating quick reference and comprehension.

By following along with these examples, users can gain proficiency in Pyspark capabilities and be better prepared for data engineering and data science interviews or real-world data processing tasks.

Filtering, selecting, aggregations, group by and order by conditions:

df = orders.join(products, "order_id", "inner") #apply joins of any 
df.join(df2, 'any common column').groupBy('any column').count().orderBy(desc('count'))


df1=df.groupBy("cust_id").agg(sum("amount").alias("bill")) #apply group by function and the aggregation would by any

df.groupBy("col1").agg(count("col2").alias("count"),
sum("col2").alias("sum"),
max("col2").alias("maximum"),
min("col2").alias("minimum"),
avg("col2").alias("average")).show()


df.drop("column_name1", "column_name2", "column_name3") #droping columns
df.drop(col("column_name")) #another way of dropping columns

df.createOrReplaceTempView("any name you wish to assign") #convert data frame to table

df.orderBy(F.desc("column_name")).first() #return first row by descending order of any column for say salary
df.orderBy(col("column_name").desc()).first() #another way of returning the highest value record
df.orderBy(col("column_name").desc()).limit(5) #returning top 5 value record

#applying filters on any columns as our wish
df.filter(df.column_name==any value or any).show()

#selecting required columns as output with filters
df.select("column1", "column2", "column3").where(col("any column")=="any value")
df.select("column1").where(col("column1")> value).show(5)
df.sort("any column name")

#rename column name
df.withcolumn Renamed("already existing column name", "change column_name we want")

PySpark offers convenient methods to extract and manipulate date attributes within Data Frames, allowing users to derive insights at various granularities such as year, month, and day.

Additionally, these attributes can be sorted in ascending or descending order to facilitate analysis and visualization.

Extracting day, month, and year from the date column:

#extract year, month, and day details from the data frame
df.select(year("date column").distinct().orderBy(year("date column")).show()
df.select(month("date column").distinct().orderBy(month("date column")).show()
df.select(day("date column").distinct().orderBy(day("date column")).show()


df.withColumn("orderyear", year(("df.date column")
df.withColumn("ordermonth", month(("df.date column")
df.withColumn("orderday", day(("df.date column")
df.withColumn("orderquarter", quarter(("df.date column")

We can apply a condition to filter out null values from a specific column and then perform a group by operation with any specified order.

df.select("column name we want to retrieve").where(col("column name we want to retrieve").isNotNUll())\
.group by ("column name we want to retrieve").count().orderBy("count", ascending=False).show(10))

Write function:

write a file in any format in any mode with any location we want.

df.write.format("CSV").mode("overwrite").save("path that we want to store file")
df.write.format("CSV").mode("append").save("path that we want to store file")
df.write.format("Parquet").mode("overwrite").save("path that we want to store file")
df.write.format("parquet").mode("append").save("path that we want to store file")

Window Functions:

wind_a=Window.partitionBy("col1").orderBy("col2").rangeBetween(Window.unboundedpreceeding, 0)

df_w_coloumn= df.withColumn("col_sum", F.sum("salary").over(wind_a) #Rolling sum or cumulative sum:


#Row_number
a=Window.orderBy("date_column") #example consideration of date column you can choose any column
sales_data=df.withColumn("row_number", row_number().over(a))

#Rank
b=Window.partitionBy("date").orderBy("sales")
sales_data=df.withColumn("sales_rank", rank() over(b))

#Dense_rank
b=Window.partitionBy("date").orderBy("sales")
sales_data=df.withColumn("sales_dense_rank", desne_rank() over(b))


#Lag
c=Window.partitionBy("Item").orderBy("date") #considering example columns you can choose any column
sales_data=df.withColumn("pre_sales", lag(col("sales"),1).over(c))


#lead
d=Window.partitionBy("Item").orderBy("date") #considering example columns you can choose any column
sales_data=df.withColumn("next_sales", lead(col("sales"),1).over(d))

This article serves as a valuable tool for individuals gearing up for data engineering interviews, providing a concise yet comprehensive compendium of PySpark functions and formulas specifically tailored for the Databricks platform.

With its structured layout and detailed explanations accompanying each example, this resource empowers readers to efficiently review and memorize key concepts within a brief timeframe of just 10 minutes.

By mastering these fundamental functionalities, aspiring data engineers can boost their confidence and readiness to navigate diverse interview scenarios with ease, thereby maximizing their chances of success in landing coveted data engineering roles.

It would be great if I could get your support by buying me a coffee to boost my energy to write more useful content.

Use This Button to Show your support to Kamireddy Mahendra.

You can find a few more articles mentioned below that will help to prepare for the data engineer interview.

You can see my Github profile to access more projects. Don’t forget to follow my Github to access all projects and to be in touch with upcoming projects as well.

SQL Use Case — Data Engineer Interview.

The Most Commonly Asking Big Data(Apache Spark) Concepts in Data Engineer Interviews.

Python Coding Questions for Data Engineer Interview Part-I (Easy Level)

I hope you will Bring your hands together to create a resounding clap, to show your support and encouragement for me to share even more valuable content in the future.

Follow me and subscribe to catch any updates from me instantly.

Thank you :)

--

--

Data Engineer - Analyst (Upcoming Data Scientist), Content Writer & Freelancer (Projects + Training)