PySpark with Google Colab

A Beginner’s Guide to PySpark

Published in

LinkIT

4 min readMay 7, 2021

Apache Spark is a lightning-fast framework used for data processing that performs super-fast processing tasks on large-scale data sets. It also can distribute data processing tasks across multiple devices, on its own, or in collaboration with other distributed computing tools.

Following features makes Apache Spark more unique,

Speed — Run workloads 100x faster.
Ease of Use — Open for several programming languages such as Java, Scala, Python, and R.
Generality — Combine SQL, streaming, and complex analytics.
Runs Everywhere — Runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.

Google Colaboratory or Colab allows anyone to write and execute arbitrary Python code through the browser and is especially well suited to machine learning, data analysis as well as educational purposes. Technically, Colab is a Jupyter notebook service hosted by Google that requires no setup to use, while providing free access to computing resources including GPUs.

When it comes to using Apache Spark on local machines, it makes some troubles and giving errors due to several reasons. As a solution, this article explains you to use PySpark (Apache Spark which supports Python) with Google Colab which is totally free.

Hands-On…!

Step 01: Getting started with Google Colabs

You can go to Google Colab from here. The following shows the initial window you can see when you go to Colab. Then select a new notebook to get started.

Welcome Screen of Google Colab — Screenshot by Author

Step 02: Connecting Drive to Colab

As the initial step when working with Google Colab and PySpark first we can mount your Google Drive. This will enable you to access any directory on your Drive inside the Colab notebook. Although this step is optional, it is helpful when you have to access files directly via your Gdrive.

from google.colab import drive
drive.mount('/content/drive')

Step 03: Setting up PySpark in Colab

Installing PySpark is pretty much simple rather than on your local machine. Just a one-line command will install PySpark for you.

!pip install pyspark

Step 04: Initialize PySpark Session

Now all set for PySpark. Next, you have to initialize the PySpark session before coding.

from pyspark.sql import SparkSessionspark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

Finally, print the SparkSession variable as follows.

spark

If there is no error you will see the following output.

SparkSession Output Screenshot by Author

Step 05: Loading data into PySpark

In PySpark we deal with large-scale datasets. So it’s an important task to load data for data processing. The following command shows how to load data into PySpark. Here we are using a simple data set that contains customer data. In read.csv() we have pass two parameters which are the path of our CSV file and header=True for accepting the header of our CSV file.

df = spark.read.csv('/content/Mall_Customers.csv', header=True)

Step 06: Data Exploration with PySpark DF

After loading data, we can perform several tasks related to our dataset. Let’s explore a few of them.

Display data - By show() operator we can display our dataset as follows.

df.show(10)

Drop null values - If there are any null values on the dataset remove them.

df = df.na.drop()

Display specific columns only

df.select("Gender","Age").show(5)

Specific column Output Screenshot by Author

Display count

df.count()

Display count output Screenshot by Author

Pretty cool right?

Now you are familiar with the basic steps needed to start your journey with PySpark and Google Colabs. Hopefully, we will meet again with more interesting areas of PySpark with Google Colabs.

Practice yourself with PySpark and Google Colab to make your work more easy. You also can get the source code from here for better practice.

Thank you for reading so far and I hope you learned something. If you enjoy my article, make sure to hit the clap button.

Happy coding!