Writing my first PySpark Notebook (and first Medium post)

Below are the steps involved in creating my first PySpark notebook. I will try to elaborate each step

Connecting to my data source (s3):

Reading data files: sc.textFile(“s3n:/<folderpath>”)

Reading pickled models — Running dependencies and copying the s3 file to a path that is readable by python

Running the model:

Broadcasting model to workers

Iterating on the data file


sklearn version match between local python where model was trained and in PySpark

Dependencies (Packages) to be installed in PySpark — We can write shell commands by prefixing cell with %sh and run pip install

Packages to be installed across all workers