Getting started with PySpark

M Haseeb Asif
Big Data Processing
3 min readSep 8, 2022

Apache spark is one of the most used data processing frameworks at the moment in the industry. It offers many benefits, including combined APIs for batch and stream processing, distributed data processing, fault tolerance, and many more. In addition, you can use a couple of APIs to write spark programs such as RDDs, Data frames, Datasets, Spark SQL, Pyspark, and so on. Today, we will look at the Pyspark, how to set it up, and write your first program.

We can run spark on the local machine or deploy the infrastructure in the cloud, but my favorite approach is to use Google Colab. It is a powerful managed cloud platform based on jupyter notebook. Google manages it, and you can run most testing use cases for free. Moreover, we don’t need to install anything on our system locally. Today, we will install the spark in the Colab and then write our pySpark code to explore a specific dataset.

Installation

Once you have created a new notebook in google colab, you need to install JAVA, Spark, findspark lib. Following are the steps to do so

# install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#download the spark
!wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
# unzip the spark files
!tar xf spark-3.3.0-bin-hadoop3.tgz
# set the environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"

Once we have all the requirements installed, we need to install findspark and initialize it. findpsark lib will locate the spark on the system and import the library for you to initalize the spark context

# findspark setup
!pip install -q findspark
import findspark
findspark.init()
# Initializing the spark context
from pyspark.sql import SparkSession
spark = SparkSession.builder\.master("local")\.appName("Colab")\.config('spark.ui.port', '4050')\.getOrCreate()

Now, we have all the basics done to start writing the spark code.

Reading the data

There are multiple ways to read the data, such as connecting to Kafka or reading from cloud storage, such as s3. We will use the local file system to read the file. We will download a sample file and read it into a data frame for a different types of transformations and actions on it.

# Download the sample data file for experimentation!wget --continue https://raw.githubusercontent.com/GarvitArya/pyspark-demo/main/sample_books.json -O /tmp/sample_books.json# read the sample file for into the dataframe
df = spark.read.json("/tmp/sample_books.json")

Data analysis

Now we have the data frame with the data, and we can perform different operations.
Before doing anything on the data, it is good to understand the data structure, rows, and columns. The print schema method will print the schema of the data set

df.printSchema()
Sample Output:
root
|-- author: string (nullable = true)
|-- edition: string (nullable = true)
|-- price: double (nullable = true)
|-- title: string (nullable = true)
|-- year_written: long (nullable = true)

If we want to have a sample data set, head method is used to have a look

df.head()
Sample Output:
Row(author='Austen, Jane', edition='Penguin', price=18.2, title='Northanger Abbey', year_written=1814)

Similary, show method is used to show the data. it has a parameter which let you control the number of rows to be displayed.

df.show(4)
Sample Output:
+----------------+---------------+-----+--------------+------------+
|author |edition |price|title |year_written|
+----------------+---------------+-----+--------------+------------+
|Tolstoy, Leo |Penguin |12.7 |War and Peace |1865 |
|Tolstoy, Leo |Penguin |13.5 |Anna Karenina |1875 |
|Woolf, Virginia |Harcourt Brace |25.0 |Mrs. Dalloway |1925 |
|Dickens, Charles|Random House |5.75 |Bleak House |1870 |
+----------------+---------------+-----+--------------+------------+

Some more common mthods are as follows

df.columns
Output:
['author', 'edition', 'price', 'title', 'year_written']
df.count()
Output:
13

Select & Filter

One of SQL’s basic constructs is the data’s selection and filtering. PySpark offers SQL similar semantics to select and filter the columns

# Show the title and price column only
df.select("title", "price").show()
# filter the books with price more than 20
df.filter("price > 20").show()
# get the names of the book with price more than 20
df.filter("price > 20").select("title","price").show()

You can play around with the data and experiment running different queries. Let’s see if you can manage to answer these questions from the data

  • Calculate the max year a book has been written
  • Count the number of books each Author has been written

Refrences

  1. https://towardsdatascience.com/pyspark-on-google-colab-101-d31830b238be

--

--

M Haseeb Asif
Big Data Processing

Technical writer, teacher and passionate data engineer. Love to talk, write and code with Apache Spark, Flink or anything related to data