Filtering Data with PySpark — A Detective’s Guide

Published in

Towards Data Engineering

3 min readOct 19, 2023

You’re an experienced data explorer going on an exciting adventure. Your job is to find valuable information in a big dataset, like discovering hidden treasures. Your ability to filter data is your secret weapon to find these treasures.

Your mission is to extract valuable information from a massive dataset. You can think of data as a giant treasure chest, and your filtering skills are the key to unlocking its secrets.

Before diving into your first case, let’s ensure you’re well-equipped. Your trusty SparkSession is your gateway to the world of data. Think of it as your data magnifying glass.

Scenario: Your first case involves sorting people by age. You need to identify all the individuals over the age of 30.

# Creating a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35), ("David", 25)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Filtering data
filtered_df = df.filter(df["Age"] > 30)

# The evidence
filtered_df.show()

Scenario: In your next case, you have to filter data using SQL expressions. It’s like conducting an interrogation of the data using SQL queries.

# Register the DataFrame as a temporary SQL table
df.createOrReplaceTempView("people")

# Filtering data using SQL
filtered_df = spark.sql("SELECT * FROM people WHERE Age > 30")

# The interrogation results
filtered_df.show()

Filtering data is a bit like collecting Pokémon cards — you just keep adding more conditions, trying to catch ’em all! 🃏🐉

Scenario: Now, you have a case with two conditions. You need to identify individuals over 30 years old who are also female.

# Filtering based on two conditions
filtered_df = df.filter((df["Age"] > 30) & (df["Name"].startswith("A")))

# Your findings
filtered_df.show()

Scenario: Imagine you work for a retail company that’s about to launch its annual Grand Sale. You’ve been tasked with analyzing sales data to determine which products are eligible for a special discount during the sale.

# Apply the filters using the 'where' method
filtered_data = sales_data.where((sales_data["units_sold"] >= 100) & 
                                  (sales_data["original_price"] > 50) & 
                                  (sales_data["category"] != "excluded"))

Filtering data in PySpark using User-Defined Functions (UDFs) can be particularly powerful when your filtering logic is more complex and can’t be easily expressed with standard DataFrame operations.

Scenario: Imagine you work for a manufacturing company that produces various products. Your role is to ensure product quality and safety. You have a dataset that contains information about the products, including quality scores and safety reports.

Your task is to filter out products with subpar quality scores and safety concerns based on specific conditions:

Products with a quality score less than 6 are considered subpar.
Products with any unresolved safety issues should be flagged.

You decide to use a UDF to create a new column that categorizes products based on the given criteria.

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col

# Define a UDF to categorize products
def categorize_product(quality, safety_issues):
    if quality < 6 or safety_issues > 0:
        return "Subpar / Unsafe"
    else:
        return "Good"

# Register the UDF
categorize_udf = udf(categorize_product, StringType())

# Load the product data into a DataFrame
product_data = spark.read.csv("product_data.csv", header=True)

# Apply the UDF to filter products
filtered_products = product_data.withColumn(
    "product_status",
    categorize_udf(col("quality_score"), col("safety_issues"))
)

# Show the products' statuses
filtered_products.show()

UDF allows you to apply complex logic to filter products based on quality and safety concerns.

If you ever encounter filtering or have questions about it, don’t hesitate to leave a comment in the section below.

AWS service you need to learn as Data Engineer

In this article I’m going to explain the basic of IAM Service. As a Data Engineer need to get familiar with IAM service…

medium.com

Filtering Data with PySpark — A Detective’s Guide

AWS service you need to learn as Data Engineer

In this article I’m going to explain the basic of IAM Service. As a Data Engineer need to get familiar with IAM service…

Written by Birappa Goudanavar