Spark With Google Colab:

Tanveer Khan
AI For Real
Published in
3 min readSep 5, 2020

--

We often want to experiment with Spark but gets stuck with the absence of Spark Environment. In this post we will discuss how to setup a Spark environment inside the google colab with the few line of codes and We can use spark right away there in few minutes.

Below are the steps for installing Spark inside the google colab:

  1. Pre-requisite for Spark is installing Java. We need to have Java installed before setting-up Spark in colab.

!apt-get install openjdk-8-jdk-headless

2. We will see a message once Java is installed. We can check Java version.

!java — version

3. Now we need to download Spark distribution. From Apache website. You can go and check the latest distribution at below link.

!wget https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

4. We need to extract the Spark distribution downloaded.

!tar xvf spark-2.4.5-bin-hadoop2.7.tgz

5. We can check extracted spark as.

!ls

!ls spark-2.4.5-bin-hadoop2.7

6. Since spark is Not on system path we would need some way to detect where spark is present. We will use a python library findspark which will add Spark to sys.path and we can use PySpark.

!pip install -q findspark

7. We need to set environment variables for the JDK and Spark path.

import os

os.environ[“JAVA_HOME”] = “/usr/lib/jvm/java-8-openjdk-amd64”

os.environ[“SPARK_HOME”] = “/content/spark-2.4.5-bin-hadoop2.7”

8. We need to initialize the findspark to detect the PySpark location as:

import findspark
findspark.init()

9. We can check the PySpark installation as:

findspark.find()

10. Now we can initialize and Spark is ready to use in colab:

import pyspark

sc = pyspark.SparkContext(appName=”myAppName”)

data = [1, 2, 3, 4, 5]

rdd = sc.parallelize(data)

rdd.collect()

Congratulations You Have Setup Spark in Your Colab Environment and it’s ready to use. Code explained in this story can be accessed in my github page as:

https://github.com/tkhan3/airo/blob/master/Spark_Setup_in_Colab.ipynb

--

--

Tanveer Khan
AI For Real

Sr. Data Scientist with strong hands-on experience in building Real World Artificial Intelligence Based Solutions using NLP, Computer I Vision and Edge Devices.