Getting Started Spark 3.0.0 in Google Colab

MA Raza, Ph.D.
Analytics Vidhya
Published in
3 min readJun 13, 2020
Designed by MA Raza using www.canva.com and images taken from google images

Apache Spark is a lightning-fast cluster computing system solving the limitation of previous favorite Map Reduce system for large data sets. It is the framework of choice for data scientists and machine learning engineers to work with big data problems. Spark engine is written in Scala which is considered to be the language of choice for scalable computing. However, Apache Spark provides high-level APIs in Java, Scala, Python, and R. Being a data scientist pyspark is my preferred API to leverage spark parallel and distributed processing. It might be a biased opinion however you can choose the API which suits your application.

According to Apache Spark Introduction for Beginners

Spark is one of Hadoop’s sub venture created in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was given to Apache programming establishment in 2013, and now Apache Spark has turned into the best level Apache venture from Feb-2014. And now the results are pretty booming.

This year, the spark will be celebrating the 10th anniversary as an open-source project. In the last 10 years, spark became the de facto choice when leveraging parallel and distributed computational frameworks for big data processing.

Spark 3.0.0 released on 18th June 2020 after passing the vote on the 10th of June 2020. However, the preview of Spark 3.0.0 was released in late 2019.

Spark 3.0 is roughly two times faster than Spark 2.4.

For data scientists and machine learning engineers, pyspark and MLlib are two most important modules shipped with Apache Spark. There are a significant number of new features in spark 3.0.0 related to the above two modules. Read below-released notes to find out more.

Based on released notes

Python is now the most widely used language on Spark. PySpark has more than 5 million monthly downloads on PyPI, the Python Package Index. This release improves its functionalities and usability, including the pandas UDF API redesign with Python type hints, new pandas UDF types, and more Pythonic error handling.

Often, setting up spark is considered to be a complex and time-consuming step for researchers having limited knowledge of JVM operating systems. In this article, I walk you through how you can set up Apache Spark 3.0.0 on google colab for a quick start.

Install Apache Spark 3.0.0

Open the google colab notebook and use below set of commands to install Java 8, download and unzip Apache Spark 3.0.0 and install findpyspark. It will not take more than a few minutes depending on your connection speed.

# Run below commands in google colab# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark3.0.0
!wget -q http://apache.osuosl.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
# unzip it
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
# install findspark
!pip install -q findspark

Set Environment Variables

Once the above commands are executed, It is time to add relevant paths to the environment. You can manage multiple versions of spark by pointing to the correct version through environment variables. Run below set of commands to point to Apache Spark 3.0.0 version downloaded earlier.

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

Quick Installation Test

Now it is time to test our spark installation and the version of it. We should be able to use Spark 3.0.0 version and pyspark with a 3.0.0 version as well.

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Test the spark
df = spark.createDataFrame([{"hello": "world"} for x in range(1000)])
df.show(3, False)

To check the pyspark version, use the below set of commands. It is highly recommended to always log the versions when running the apps.

# Check the pyspark version
import pyspark
print(pyspark.__version__)

Working Google Colab

I have also created a working google colab and can be found below.

Written by M.A Raza

Conclusions

In this brief article, we learned how to set up the Spark 3.0.0 in less than 2 minutes.

References Readings/Links

  1. http://apache.osuosl.org/spark/spark-3.0.0-preview2/
  2. https://medium.com/@sushantgautam_930/apache-spark-in-google-collaboratory-in-3-steps-e0acbba654e6
  3. https://notebooks.gesis.org/binder/jupyter/user/databricks-koalas-kuv5qckt/notebooks/docs/source/getting_started/10min.ipynb
  4. https://medium.com/@sushantgautam_930/apache-spark-in-google-collaboratory-in-3-steps-e0acbba654e6
  5. 1 https://towardsdatascience.com/introduction-to-apache-spark-207a479c3001
  6. https://spark.apache.org/
  7. https://medium.com/@amjadraza24/spark-ifying-pandas-databricks-koalas-with-google-colab-93028890db5

--

--

MA Raza, Ph.D.
Analytics Vidhya

Data Scientist | ML Engineer| Digital Business Transformation