Pyspark in Jupyter installing on EdgeNode

Srikanth Pannala
Aug 24, 2017 · 3 min read

Before feeding your data set to ML model, we often need to explore/per-process the data-set - to the format we needed. What else better than Spark to perform ETL and exploratory work! Yeah, but every time logging into spark console and not able to run ant analytical graphs is not a festive. So, I ended up integrating jupyter notebook in to my big data ecosystem. It turns out that, I am just 3 mins away from running my spark code from jupyter notebook.

Assuming

Spark (i have 2.2) cluster is up and running, conda is installed and hadoop(2.7) — are in the $PATH

On Edge Node

Confirm spark path SPARK_HOME:

Create Conda env

conda create -n pyspark3 python=3
source activate pyspark3

Install jupyter

pip install jupyter

Configure jupyter notebook

jupyter notebook --generate-config

Remote login

This will generate jupyter notebook config file under: /home/$USER/.jupyter/jupyter_notebook_config.py.

To make jupyter notebook work remotely update the below attributes

c.NotebookApp.ip = ‘*’

c.NotebookApp.port= ‘$PORT_U_LIKE’

Secure your notebook

Run below command to setup the password for your jupyter server

jupyter notebook password

this will generate the hash password in /home/$USER/.jupyter/jupyter_notebook_config.json. Copy the password and update the /home/$USER/.jupyter/jupyter_notebook_config.py file with below attribute

Toree

I found toree pyspark kernel to be extremely unstable with spark 2.2, so decided not to pursue. As alternatively went with findspark

Start jupyter notebook

Run notebook in background. Create a simple script called ‘start-notebook.sh’ and include below:

#!/bin/bash
exec jupyter notebook --no-browser &> /dev/null &

Then add it to the path, by copying start-notebook.sh to /usr/local/bin and then run start-notebook.sh

You should be jupyter notebook like below:

Run spark job

Create a new notebook, import findspark and necessary packages

Create spark session

SPARK — HDFS

SPARK — HIVE

That’s its, now you have visual data exploratory through spark-jupyter notebook. Happy exploration!

)

Srikanth Pannala

Written by

The purpose of life is to be nobody

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade