Efficient Data Processing with PySpark and SparkSQL

Published in

CodeX

4 min readFeb 11, 2023

PySpark is a Python API for Apache Spark, an open-source, distributed computing framework that enables big data processing. One of the powerful features of PySpark is the ability to perform SQL-like queries on large datasets. Spark SQL is a Spark module for structured data processing that allows querying of data using SQL syntax. Spark SQL is used to execute SQL queries. This opens the door for those who already know SQL to apply it to PySpark.

In this article, you will gain a detailed explanation of Spark SQL and how it can be used with PySpark.

PySpark and Spark SQL

PySpark provides an easy-to-use interface to Spark SQL, allowing users to perform complex data processing tasks with few lines of code. With PySpark, users can create Spark DataFrames, which is similar to Pandas DataFrames and can be queried using Spark SQL.

Spark SQL is a module in Apache Spark that provides a programming interface and SQL query engine for data stored in Spark’s supported data sources. It allows users to query data stored in Spark’s Resilient Distributed Datasets (RDDs), Apache Hive, Parquet, JSON, and JDBC data sources.

PySpark dataframe needs to be registered before they can be queried with Spark SQL.

When you want to create a table to query with Spark SQL, use the createOrReplaceTempView() method.

This method takes a single string parameter, which is the desired name of the table. This action will look at the dataframe referenced by the Python variable on which the method was applied and will create a SparkSQL reference pointing to that dataframe.

Once we have registered our table that points to the same dataframe as our
Python variable, we can query our table without any problems.

You can use the spark.sql() method to perform Spark SQL operations. The spark.sql() method takes an SQL query as an argument and returns a DataFrame that represents the result of the query.

Benefits of using SparkSQL

Ease of Use:

Spark SQL provides a simple interface making it easier for data scientists and engineers to work with large datasets. Spark SQL allows users to query data using familiar SQL syntax or through code.

Performance:

Spark SQL takes advantage of Spark’s powerful engine to perform data processing tasks quickly and efficiently.

Support for Different Data Sources:

Spark SQL supports a wide range of data sources, including Apache Hive, Avro, Parquet, ORC, JSON, and JDBC, making it easier to work with various types of data.

Assumptions

This tutorial assumes that you already have a knowledge of pyspark and SQL, and you have pyspark already installed on your machine.

Using SparkSQL

First, let’s start by creating a SparkSession, which is the entry point for PySpark

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.appName("Implementing SparkSQL in PySpark").getOrCreate()

data = [("John", 25), ("Jane", 30), ("Jim", 35)]

df = spark.createDataFrame(data, ["Name", "Age"])

Now that you have a Spark DataFrame, we can use SparkSQL to perform some basic operations on the data.

First, you need to register the pyspark dataframe before it can be queried with SparkSQL

# Register the DataFrame
df.createOrReplaceTempView("people")

“The table named ‘people’ is created, on which you can run your SQL queries.”

You can now use the spark.sql() method to query the data

# Perform a Spark SQL query on the table
result = spark.sql("SELECT Name, Age FROM people")


#Show the output
result.show()

The result of the SQL statement returns a dataframe. This means that you can still apply your regular pyspark code to the resulting output.

The show method is used to see the result.

Running queries in PySpark and SparkSQL

The goal of this section is to run queries on a dataset using SparkSQL and also compare the same with PySpark queries.

The dataset used can be found here

supermarket_sales_df = spark.read.csv("supermarket_sales.csv", header=True, inferSchema=True)

# show the datatype of each column
supermarket_sales_df.printSchema()

## Convert column names to lowercase and remove spaces in column names
supermarket_sales_df = supermarket_sales_df.toDF(*[column.replace(' ', '_').lower()
                                                   for column in supermarket_sales_df.columns])

supermarket_sales_df.columns

supermarket_sales_df.createOrReplaceTempView("supermarket_sales")

Now you can run SQL queries with spark.sql()

Compare SparkSQL and PySpark Queries

Performing a Group By query

You can run whatever select SQL query you can think of, using the spark.sql() method. This makes it easier to bring your existing SQL knowledge to use with PySpark.

Conclusion

SparkSQL is a powerful feature in Apache Spark that enables users to perform SQL-like operations on large datasets. By combining the strengths of Spark and SQL, SparkSQL offers a powerful tool for large-scale data processing, making it a popular choice for big data applications.

In this article, we have covered the basics of Spark SQL and how it can be used in tandem with PySpark.

Now you can decide to use which syntax feels natural to you to get your job done.