Learning PySpark with Google Colab
TL;DR PySpark on Google Colab is an efficient way to manipulate and explore the data, and a good fit for a group of AI learners.
Learning Apache Spark with a quick learning curve is challenging. Discover distributed computation and machine learning with PySpark, with several tutorials until building your movie recommendation engine.
Links: GitHub | Tutorials quick start | Dataset
Let’s discover how to use PySpark on Google Colab with accessible tutorials.
As a teaching fellow with David Diebold about Systems, paradigms, and algorithms for Big Data for the international Master IASD (graduate degree M2) for the French Dauphine Paris University member of the PSL University, I needed to organize sessions of tutorials for the students on the distributed computation with Apache Spark.
Fast, flexible, and developer-friendly, this data-distributed processing framework has become one of the world’s most significant. Before teaching the features provided by Spark, we had to choose which language and platform our learners could run the tutorials we prepared. We chose the tech stack: PySpark with Google Colab.
PySpark vs. Spark
PySpark allows interaction with Spark in Python. It gives a better learning curve than Spark (written originally in Scala). Even though it is less performant for a production world using Py4J to interact with the JVM of Spark, it gives sufficient performance (sometimes close to Spark with Java/Scala) to experiment with distributed data science and machine learning.
PySpark with Python remains largely the preferred language for Notebooks.
The support of PySpark is already excellent and continues to be improved with the future version of Spark 3+.
Python is easier to learn than Scala and has a mature ecosystem for applied mathematics.
For these reasons, we preferred teaching PySpark over Spark. Moving from one to another is easy, only the cost for the learners to become familiar with the Scala/Java ecosystem for advanced use.
Google Colab vs. other notebook platforms
For coding and runtime, Google Colab gives a development environment for PySpark, like Databricks Community or JupyterLab. All these solutions are free for education and not only. You can manipulate notebooks to interact with a Spark cluster with a REPL. It remains compatible with the Jupyter Notebook format .ipynb.
We have tried and evaluated all these solutions:
- Databricks Community is an online platform with the same features but sometimes needs more stability.
- Having Jupyter Lab on a docker container or with bare metal takes much time and complications to set up a working environment because it depends on each student’s workstation.
Here are the interesting features of the Google Colab platform:
- 💸 Free: Access to an online IDE and Runtime environment with a Google account
- 0️⃣ Zero-configuration: Upload or create a Jupyter Notebook quickly online with a web browser
- 📂 Iso-installation: All users are in the same independent working environment for Python on a Linux container with the same resources for GPU, CPU, Disk, RAM … very handy to reproduce issues.
- 😻 Easy handling: Intuitive web UI with auto-completion, documentation, shortcuts …
- 💌 Easy to share: Share content online directly with an audience, or download your work to other platforms thanks to the portable format .ipynb
For these reasons, we preferred using Google Colab this time. It works well.
PySpark on Google Colab
Let’s focus on the Google Colab platform with PySpark. The environment setup is light; tools are pre-installed, and resources are allocated. Every learner has the same ecosystem.
See an example of a PySpark Notebook on Google Colab in this tutorial to discover the RDD API.
You only need to follow these statements to install PySpark and build a Spark session/context:
- Install the latest release of Spark pre-built for Hadoop with the related JDK, YARN, and HDFS will only be used partially with the local set-up, then configure your Java and Spark environment required by PySpark with the usage of Py4J and Spark cluster using the JVM under the hood:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!curl -O https://dlcdn.apache.org/spark/spark-3.2.3/spark-3.2.3-bin-hadoop3.2.tgz
!tar xf spark-3.2.3-bin-hadoop3.2.tgz
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.3-bin-hadoop3.2"
PySpark isn’t on sys.path by default, but that doesn’t mean it can’t be used as a regular library. Install and import findspark to resolve it:
!pip install -q findspark
import findspark
findspark.init()
Or you can install PySpark directly with the following command:
!pip install pyspark==3.2.3
2. Then build your PySpark session/context:
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
conf = SparkConf().set('spark.ui.port', '4050')
spark = SparkSession.builder.config(conf=conf)\
.master('local[*]')\
.getOrCreate()
sc = spark.sparkContext
spark
SparkSession — in-memory
SparkContext
Spark UI
Version: v3.2.3
Master: local[*]
AppName: pyspark-shell
Note: The port of the Spark UI is set to resolve a conflict because the port of Ngrok is also 4040 by default. The Spark UI is not accessible directly with Google Colab. See the following sections.
We use only the local mode with all cores of the container to submit the PySpark application, that is to say, Spark in-process (without a resource manager) where the Spark Driver and a single Spark Executor are in the same JVM locally to optimize the resources provided by the single Google Colab allocated container and by simplicity. So, the PySpark cluster is deployed in client mode (required by the local mode and the interactive mode of a PySpark Notebook) and configured to read and write on the local file system of the container.
The limitations of PySpark with Google Colab are:
- 🔺The accessibility of the Spark UI in the web browser is challenging because the container (where the Spark cluster runs) is in a private network. A workaround solution is to build a secure local tunnel with remote port forwarding (with Ngrok).
- 🔺The usage of multiple notebooks in the same session is painful. Distinct notebooks are isolated from others in Google Colab. Isolation isn’t configurable. When a notebook is opened, a new Linux container is allocated, and you need to build libraries or scripts to share data, code, or installed packages.
- 🔺The resources are limited. The free offer of Google Colab is for a single Linux container with limited resources — RAM: 12 GiB, Cores: 2, Disk: 100 GiB ... with a limited number of active sessions. We don’t have the opportunity to use a more significant Spark cluster with many big executors to see issues at a large scale.
Courses, tutorials, and a project for teaching
I co-organized 4 sessions for an introduction to Apache Spark for a class of ~30 students in hybrid mode (In-person and Virtually) from the international Master IASD (graduate degree M2) that you can re-use freely.
Note: As Dataset has compile-time safety, it is only supported in a compiled language (Java/Scala) but not an interpreted language (Python/R). So Dataset API is not supported by PySpark.
The tutorial goals are to:
- 💻Discover the PySpark API: learn the data-partitioned structure RDD and Dataframe and the framework’s features for the distributed computation with an execution plan, the tasks/stages/jobs, DAG, … provided by PySpark through the Spark Core > RDD API and the Spark SQL > DataFrame API
- 📰 Read official resources: PyDoc, source code, documentation, and others provided by PySpark before searching on web search engines or other Q&A platforms.
- 📊 Have notions of magnitude orders: knowing if my complexity in memory and execution scale for larger data
The public and accessible MovieLens dataset is explored and manipulated with exercises.
Here is our agenda to teach the 2 first sessions on the PySpark API:
- 📚 Give lessons with slides and video projector/screen sharing (20min)
- ✉️ Send the slides and tutorials on a private Slack channel
- ✅ Send a message per question and ask students to point out the resolved questions in a reaction to follow the advancement of each student, or ask their raised questions in writing inside the thread of each question
- ❔/💣 Resolve questions or blocking points to each student individually or collectively, if repetitive orally in a synchronous way and written asynchronously in the slack channel
Then we sent the solutions of the previous session at the beginning of the next session and gave feedback on errors seen during the previous session or tips on using PySpark on Google Colab (10min).
At the end of these tutorials, we prepared a quick survey for students with Google Forms to improve the next iterations of this course. Here are the results based on the number of answers:
- 🙌 Answers: 22/28
- 📈 Average Ratings: 4.6/5 (min: 3/5 max: 5/5)
NPS: 31.9% (Detractors: 13.6% Passives: 40.9% Promoters: 45.5%) - 📎 Summary:
The students find the tutorials well organized with a level of difficulty: medium or easy, and helpful but long to finish. The teachers are available according to them. They wish for more theory and examples in the slides while the content is well integrated into the course. They are confident enough to apply what they have learned. Furthermore, they suggest the questions be explained, with a maximum amount of time to do each, and after this limit is exceeded for a question to give an interactive response.
The 2 last sessions were a rating project per group of 2 or 3 students to begin during these sessions and to finalize after, for a render 2 weeks later. We guided only the students without giving a solution. We used Zoom with breakout rooms for the last sessions virtually to split the class into virtual groups. The students had to render a draft at the end of each session. Like it, we could give feedback before the final render and rating.
The grade performance on the project is heterogeneous but without outliers, with a standard deviation (STD) of 2.3/20 and an interquartile range (IQR) of 3/20. The ratings are distributed from 10/20 to 17/20. The excellent aspect is that
- No group has under 10 out of 20, so all have succeeded in the project more or less good.
- The general grade level is good, with an average of 13.8/20 and a median of 14.5/20.
The project is split into 4 parts containing several questions each.
During the correction of each rendered question, we evaluated the quality of the answer with a success rate; the awarded points, that is to say, the score, were computed as it:
If I analyze the success rate by parts:
As expected, the parts are by order of difficulty, and then the performance of the grade is more heterogenous in difficulty. Effectively, the average success rate decreases while the IQR and STD increase from part A to part D.
With this method, I can estimate (a posteriori) the difficulty level of each question, at which point it is discriminant:
The level of difficulty of a question can be interpreted with the average of success rates:
- 0–25%: difficult
- 25%-75%: medium
- 75%-100%: easy
A question with an IQR or an STD of success rate above 50% can be viewed as discriminant.
For example, question CPLSI1.8 is medium and discriminant, while question CPLSI2.3 is difficult and non-discriminant.
After these sessions and the correction of each project, the standard errors of our students were in understanding the following points:
- Laziness of the Spark framework: the exception raised was not necessarily caused by the last executed cells of the framework.
- When to use/or are used distributed structures provided by Spark? RDD and Dataframe vs. other parallelized or not structures
- How to use the PySpark API directly instead of re-implementing it with other low-level functions of PySpark or not
- How to know if the Spark implementation is efficient and effective? Estimate the complexity and resources usage, if data and operation are well distributed
- How to interpret an ambiguous question and have a critical opinion on its given answer
Go further
During these PySpark tutorials, we present RDD API and DataFame API, but there are a lot of other features that can be interesting to explore:
Also, you could use Spark Connect shortly for the setup to run Spark everywhere.
References
If you want to learn Spark/PySpark, read Learning Spark, 2nd edition, and see Spark examples.
If you have questions about Google Colaboratory: go to the FAQ.
The content of these tutorials is accessible via GitHub without the solutions: go to criteo-research/master-iasd. We keep the solutions private. Ask me if you want to have it.