Analyze Data with PySpark

Andrés Felipe Mesa David.
5 min readOct 1, 2022

Have you heard much about Spark but have no idea what it is? And, to keep it simple:

Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer)

I plan to teach you how to analyze a sample data set in this article. Still, first, we’ll need to set up our environment to work with PySpark (And if you would like to complement this tutorial, FreecodeCamp has this excellent video).

Nevertheless, installing Spark could be incredibly complicated; that’s why we will use Databricks, as this tool already has everything in place for you to start working right away.

If you want to create a free account, you can use the community edition here.

Note: Make sure to click on the Community Edition

And you should get something like this.

Before going into the details, you can watch a tutorial on my Youtube channel and why not, subscribe?

Alright! It’s time to begin. But first, we need the data (that, of course, you can find on my Github). These tree files are part of a pretty small DB created to practice PySpark, and the ER model is the following.

Here you can get the final Notebook in case you want to jump directly to the final result

Now, you may wonder how to upload these files to Databricks. And you can do that by going into the sidebar > Data > Create Table (in the upper right)

And then upload these three files.

We have imported all the necessary files in the FileStore (a particular folder within the Databricks File System (DBFS) where you can save files and have them accessible to your web browser). Check more details here.

The files are in /FileStore/tables/project, and you can list them with the following command.

%fs ls /FileStore/tables/project

Let’s jump to analyze the data.

Suppose that you want to answer the following questions:

  1. How much each customer spent in the restaurant?
  2. How many days has each client visited the restaurant?
  3. Which was the first item each client bought?

But you need to import this data into your SparkSession (aka spark), and we are going to do that after importing all the necessary libraries

# Import sql types to define our schemas
from pyspark.sql.types import *
# Import all SQL functions as F to clean and transform data
import pyspark.sql.functions as F
# To do window functions
from pyspark.sql.window import Window
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

I’m going to create the sales_df for you, and I’ll leave it up to you to create member_df and menu_df(remember you have the complete solution here or in the repo, and even more transformations and actions)

# Create a dictionary for the file locations
files = {
'menu': '/FileStore/tables/project/menu.csv',
'members': '/FileStore/tables/project/members.csv',
'sales':'/FileStore/tables/project/sales.csv'}
# Define the proper schema
sales_schema = StructType([
StructField('customer_id', StringType(), False),
StructField('order_date', DateType(), False),
StructField('product_id', IntegerType(), False),
])
# Create the file pyspark DataFramesales_df = (
spark
.read.format('csv')
.options(header=True)
.schema(sales_schema)
.load(files['sales'])
)
#Then do a display or show on the DataFrame
display(sales_df) # this is from Databricks, you can use .show() in pySpark

After having all the data frames, let’s answer the first question.

result_1_dataframe = (
sales_df
.join(
menu_df,
sales_df.product_id == menu_df.product_id,
"inner")
.groupBy('customer_id')
.agg(
F.sum('price').alias('Total spend')
)
.orderBy(F.col('Total Spend').desc())
)
result_1_dataframe.show()

We can graph this with pandastoo. How does it work? With the toPandas() method.

result_1_dataframe = result_1_sql.toPandas()
sns.set('notebook')
sns.set_style('white')
fig, ax = plt.subplots()
sns.barplot(
data=result_1_pandas,
x='customer_id',
y='total_spend',
ax=ax
)
fig.suptitle('$Total Spend by each customer', fontsize=20)
plt.xlabel('Customer', fontsize=18)
plt.ylabel('$ Total Spend', fontsize=16)
plt.show()

We could have generated the same DataFrame with PySpark SQL but we need to create temporary views on DataFrame/Dataset by using createOrReplaceTempView() and using SQL to select and manipulate the data.

# Create the temporary views
sales_df.createOrReplaceTempView('sales')
members_df.createOrReplaceTempView('members')
menu_df.createOrReplaceTempView('menu')
#create the SQL logic as a string
result_1_query = """
SELECT
s.customer_id
,SUM(m.price) AS total_spend
FROM sales AS s
INNER JOIN menu AS m
ON s.product_id = m.product_id
GROUP BY s.customer_id
ORDER BY total_spend DESC
"""
# generate a PySpark DataFrame
result_1_sql = (
spark
.sql(result_1_query)
)
# Show the result
result_1_sql.show()

As you just saw, we could transform our pyspark.dataframe.DataFrame into a pandas.core.frame.DataFrame and use Seaborn and Matplotlib to graph our data.

Also, we could use the cool features of Databricks to visualize our data to keep it simple. I aim to hand you as many tools as possible to handle real-life scenarios.

I’ll leave the other two questions as a challenge.

Finding it too complicated? Go to the solution I already provided.

Stay tuned for more.

Follow me for more.

--

--