PySpark with Google Colab
A Beginner’s Guide to PySpark
Apache Spark is a lightning-fast framework used for data processing that performs super-fast processing tasks on large-scale data sets. It also can distribute data processing tasks across multiple devices, on its own, or in collaboration with other distributed computing tools.
Following features makes Apache Spark more unique,
- Speed — Run workloads 100x faster.
- Ease of Use — Open for several programming languages such as Java, Scala, Python, and R.
- Generality — Combine SQL, streaming, and complex analytics.
- Runs Everywhere — Runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.
Google Colaboratory or Colab allows anyone to write and execute arbitrary Python code through the browser and is especially well suited to machine learning, data analysis as well as educational purposes. Technically, Colab is a Jupyter notebook service hosted by Google that requires no setup to use, while providing free access to computing resources including GPUs.
When it comes to using Apache Spark on local machines, it makes some troubles and giving errors due to several reasons. As a solution, this article explains you to use PySpark (Apache Spark which supports Python) with Google Colab which is totally free.
Hands-On…!
Step 01: Getting started with Google Colabs
You can go to Google Colab from here. The following shows the initial window you can see when you go to Colab. Then select a new notebook to get started.
Step 02: Connecting Drive to Colab
As the initial step when working with Google Colab and PySpark first we can mount your Google Drive. This will enable you to access any directory on your Drive inside the Colab notebook. Although this step is optional, it is helpful when you have to access files directly via your Gdrive.
from google.colab import drive
drive.mount('/content/drive')
Step 03: Setting up PySpark in Colab
Installing PySpark is pretty much simple rather than on your local machine. Just a one-line command will install PySpark for you.
!pip install pyspark
Step 04: Initialize PySpark Session
Now all set for PySpark. Next, you have to initialize the PySpark session before coding.
from pyspark.sql import SparkSessionspark = SparkSession.builder\
.master("local")\
.appName("Colab")\
.config('spark.ui.port', '4050')\
.getOrCreate()
Finally, print the SparkSession variable as follows.
spark
If there is no error you will see the following output.
Step 05: Loading data into PySpark
In PySpark we deal with large-scale datasets. So it’s an important task to load data for data processing. The following command shows how to load data into PySpark. Here we are using a simple data set that contains customer data. In read.csv()
we have pass two parameters which are the path of our CSV file and header=True
for accepting the header of our CSV file.
df = spark.read.csv('/content/Mall_Customers.csv', header=True)
Step 06: Data Exploration with PySpark DF
After loading data, we can perform several tasks related to our dataset. Let’s explore a few of them.
- Display data - By show() operator we can display our dataset as follows.
df.show(10)
- Drop null values - If there are any null values on the dataset remove them.
df = df.na.drop()
- Display specific columns only
df.select("Gender","Age").show(5)
- Display count
df.count()
Pretty cool right?
Now you are familiar with the basic steps needed to start your journey with PySpark and Google Colabs. Hopefully, we will meet again with more interesting areas of PySpark with Google Colabs.
Practice yourself with PySpark and Google Colab to make your work more easy. You also can get the source code from here for better practice.
Thank you for reading so far and I hope you learned something. If you enjoy my article, make sure to hit the clap button.
Happy coding!