A Gentle Introduction to PySpark

Learn the Analytics Engine used by Facebook, Netflix, and other Tech Giants

Greg Hogg
8 min readDec 28, 2021
Source: https://databricks.com/glossary/pyspark

It was May of 2018, and what happened next was terrifying.

Just moments ago, I had confidently opened up a new Notebook in Colab, excited to perform analytics using my favorite programming language.

A simple import Pandas as pd, and pd.read_csv(...)should’ve been all I needed to get going…

… But I couldn’t believe what I saw instead:

It happened — I had run out of memory.

Trying desperately to compose myself, I recalled a quote from one of my professors at the University of Waterloo:

“Most companies have data too large to store in 16GB of RAM. Therefore, they depend on Apache Spark to run the analytics that drive their success.

Even if they could use another tool, Spark is significantly faster at manipulating Big Data — and if you too adore Python— PySpark is the best option.”

It appears I could put this off long no longer; if I were to progress as a data scientist — I had no choice but to learn PySpark.

And so I did. Today, I share a condensed version of what I learned.

Note 1: PySpark is the Python API to…

--

--

Greg Hogg

Data Scientist | University of Waterloo Statistics | YouTuber w/ 50,000+ Subscribers