Getting Started with PySpark and Jupyter

Rajhans Jadhao
Sep 6, 2018 · 1 min read

Install Jupyter Notebook

pip install jupyter

Install pySpark

Before installing spark, you should have Java8 or higher version then download the latest version of spark which is a prebuilt package with Hadoop.

Extract and move it to /opt folder

$ tar -xzf spark-2.2.1-bin-hadoop2.7.tgz$ mv spark-2.2.1-bin-hadoop2.7 /opt/spark-2.2.1

Create a symbolic link

$ ln -s /opt/spark-2.2.1 /opt/spark

Configure your $PATH variables, add the following lines into your ~/.bashrc

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

PySpark in Jupyter

Update PySpark environment variables, add the following lines into your ~/.bashrc file

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Restart the terminal and run PySpark

$ pyspark

Now you will be able to run PySpark in jupyter notebook !!!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade