Albert Franzi
Albert Franzi
Published in
2 min readMay 16, 2019

--

Install PySpark + Jupyter + Spark

This article aims to provide a summary about how to install PySpark + Jupyter locally to start working, playing & learning Spark. The content comes from the subsection of one of my previous articles — Empowering Spark with MlFlow.

Mar Bella Beach, Barcelona. Photo by Albert Franzi.

Setting up PyEnv

We first need to set up all the Python Environment to use PySpark, we would use PyEnv (ref: Python in Mac). This will provide a virtual environment where we can install all the libraries required to run it.

Install PySpark + Jupyter + Spark

Ref: Get started PySpark — Jupyter

The notebook-dir option is quite useful in case we want to have a proper structure of our notebooks. Besides, we could have our own GitHub repo as the notebook dir to track our changes.

Launch Jupyter from PySpark

Since we configured Jupyter as the PySpark driver, now we can launch Jupyter with a PySpark context attached to our notebooks.

PySpark Dummy Example from Pyspark Doc — createDataFrame.

Scala Spark

Once we have our PySpark notebooks working, we can move to set up our Scala Spark Kernel. In this case, we will set up the Toree Kernel into our existing Jupyter.

You can find the list of all the Jupyter kernels available in Github -Jupyter-kernels.

Scala Spark Dummy Example

Spark with Hadoop 2.8.5

It’s possible that you need to use Spark with Hadoop 2.8.5 (or newer) since the current spark with bin comes with Hadoop 2.7.3.

If that’s the case, then you will need to set up Hadoop from one side and then Spark without Hadoop.

Besides, we had some troubles with S3, since we need to assume-role to access S3 buckets from different accounts. For this reason, we need to use the org.apache.hadoop:hadoop-aws:2.8.5 package and define the TemporaryAWSCredentialsProvider as the credentials provider.

--

--