Apache Spark(PySpark) in Google Collaboratory In 3 steps.

Sushant Gautam
Sep 29, 2018 · 2 min read

Apache Spark is the leading platform for large-scale SQL, batch processing, stream processing, and machine learning.

Spark can be installed locally but, there is the option of Google Collaboratory on the free Tesla K80 GPU where we you can use Apache Spark to learn. Choosing option Collab is a really easy way to get familiar with Spark without installing it locally or setting up and maintaining an EC2 instance.

There are together 3 steps to do that:

  1. Install Java, Spark
  2. Set Environment Variables
  3. Start Spark Session

First install Java, Spark and and run a local Spark session by just running this on Google Colab:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

This installs Apache Spark 2.4.4, Java 8, and Findspark, a library that makes it easy for Python to find Spark.

Second set the locations where Spark and Java are installed to let know Collab where to find it.

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

Finally, Use Spark!

That’s all there is to it — you’re ready to use Spark!

df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])

By the end you have your own spark setup on collab free on your own Google Drive.

Also to run the spark and use it

import findspark
findspark.init("spark-2.4.4-bin-hadoop2.7")# SPARK_HOME

Let’s Manipulate some predefined Google Collab Sample_Data

file_loc = ‘./sample_data/california_housing_train.csv’
df_spark = spark.read.csv(file_loc, inferSchema=True, header=True)

<class 'pyspark.sql.dataframe.DataFrame'>

df_spark.printSchema() # print detail schema of data
df_spark.show()# show top 20 rows
# Do more operation on it.

Visit this tutorial in Github or Try in Google Collab to get started.

If you like it, Please share and click green icon by which it gives more energy to write more. Stay tune for the next post on detail analysis data on pyspark Collab

Sushant Gautam

Written by

I am Learning Machine Learning, Artificial Intelligence, and applying Artificial Intelligence in Robotics. Interested in Arduino, Raspberry Pi , Automation.

More From Medium

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade