Configuring Jupyter for PySpark 1.5.2 with pyenv/virtualenv

Gerben Oostra
May 18, 2016 · 6 min read

To quickly explore and visualize data I love python in a Jupyter notebook.
It’s great to do data exploration, while having a readable trace of your attempts and deductions.

Virtual environments

If you have many different projects, you can have two projects depending on conflicting dependencies.
You might also want to explore the latest snapshot version of scikit-learn in only one project, without breaking your other projects.
Fortunately Python has the possibility for virtual environments, such that the set of dependencies of one of your projects doesn’t conflict with the dependencies of another.

One way to achieve this is installing virtual environments in your project directory, and start Jupyter from that environment.
If you however are running Jupyter as a service, or would like to share virtual environments amongst some projects, a central repo of available environments is useful.

Jupyter uses kernels for different languages, which also can be used to run your code using different virtual environments. Alfredo Motta has explained nicely how to configure Jupyter for your virtual environments.

Spark

If you want to add PySpark to your notebooks you will then have to alter the kernel to obtain a SparkContext. This is described by Jacek Wasilewski.

The goal

Here we combine these two approaches, for updated python and spark versions, resulting in a Jupyter notebook with a kernel for your virtual env, with possibly a SparkContext.

Step 1: Install python env

With pyenv we can manage different python installations, and activate certain ones in your shell.

First install pyenv, which on MacOSx is done using:

:~$ brew update
:~$ brew install pyenv

Or with apt-get

apt-get install python-dev
apt-get install python-pip

And subsequently add the following to your .bashrc profile:

eval "$(pyenv init -)"

Step 2 : Install python virtual environments

With virtualenv we can create different environments based on one python version.
This allows us to create environments holding different sets of requirements.

It is created with python, thus next install virtualenv as follows:

:~$ pip install virtualenv

With virtualenv you can create the virtual environments in a specific folder, and activate them by running the bin/activate_this.py from that folder.

With virtualenvwrapper you have one repositories of virtual environments, and can create and activate them by name (instead of by path).

:~$ pip install virtualenvwrapper

Now virtualenvwrapper and virtualenv are installed in your global python install.
Next we install the python version of pyenv, and allow it to work together with virtualenv.

:~$ pip install --egg pyenv
:~$ pip install pyenv-virtualenv

Then add the following to your bashrc profile to:

eval "$(pyenv virtualenv-init -)"

Step 3 : Create a python virtual environment

With the prerequisites installed, we will create our specific virtual environment.

To list all available python versions:

:~$ pyenv versions

Let us first install python version 2.7.10:

:~$ pyenv install 2.7.10

Using the installed python version, we can create an independent virtual environment, holding my separate set of dependencies:

:~$ pyenv virtualenv 2.7.10 timeseries

We can see all created python versions as follows:

:~$ pyenv versions
* system (set by /Users/gerben/.pyenv/version)
2.7.10
2.7.10/envs/timeseries
timeseries

And the available virtualenvs with

:~$ pyenv virtualenvs
2.7.10/envs/timeseries (created from /Users/gerben/.pyenv/versions/2.7.10)
timeseries (created from /Users/gerben/.pyenv/versions/2.7.10)

Let us add the bleeding edge scikit-learn version in the virtual environment:

:~$ pyenv activate timeseries
:~$ pip install git+https://github.com/scikit-learn/scikit-learn.git#egg=SciKitEgg

Now install the python kernel to be used by Jupyter:

:~$ pip install ipykernel

And deactivate the environment:

:~$ pyenv deactivate

Step 4 : Create a Jupyter kernel

The goal is to make Jupyter aware of the virtual environments, allowing us to create notebooks in them.

First install jupyter if you haven’t got it yet in your system python environment:

:~$ pip install jupyter

Where Jupyter stores its kernels depends on your nix variant. Let us ask Jupyter which paths it uses (this is on MacOSx):

:~$ jupyter --paths
config:
/Users/gerben/.jupyter
/System/Library/Frameworks/Python.framework/Versions/2.7/etc/jupyter
/usr/local/etc/jupyter
/etc/jupyter
data:
/Users/gerben/Library/Jupyter
/System/Library/Frameworks/Python.framework/Versions/2.7/share/jupyter
/usr/local/share/jupyter
/usr/share/jupyter
runtime:
/Users/gerben/Library/Jupyter/runtime

Under one of the data directories (they are scanned in listed order) we can make our extra kernel which refers to our virtual environment’s python.
We first need to find the location of our python interpreter:

:~$ pyenv activate timeseries
:~$ pyenv which python
/Users/gerben/.pyenv/versions/timeseries/bin/python
:~$ pyenv deactivate
:~$ mkdir -p /Users/gerben/Library/Jupyter/kernels/timeseries

In that directory, create a kernel.json file with the following content:

{
"display_name": "timeseries_2.7",
"language": "python",
"argv": [
"/Users/gerben/.pyenv/versions/timeseries/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"]
}

Now you can run jupyter and see your new kernel;

:~$ jupyter notebook

Step 5: Optional: Enabling matplotlib on Mac OS X

With your new jupyter kernel, let’s add a simple python code block:

import matplotlib.pyplot as plt

This can give you the following error:

RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are Working with Matplotlib in a virtual enviroment see 'Working with Matplotlib in Virtual environments' in the Matplotlib FAQ

In that case, we have to follow the referenced faq,
and create the following bash script.

Replace the PYTHON value with the specific binary on your machine, and save it as frameworkpython in your virtual environment python dir.
In my case, that's at /Users/gerben/.pyenv/versions/timeseries/bin/frameworkpython.

#!/bin/bash# what real Python executable to use
#PYVER=2.7
#PATHTOPYTHON=/usr/local/bin/
#PYTHON=${PATHTOPYTHON}python${PYVER}
PYTHON=/usr/local/Cellar/python/2.7.11/bin/python2.7
export PYSPARK_PYTHON
# find the root of the virtualenv, it should be the parent of the dir this script is in
ENV=`$PYTHON -c "import os; print os.path.abspath(os.path.join(os.path.dirname(\"$0\"), '..'))"`
# now run Python with the virtualenv set as Python's HOME
export PYTHONHOME=$ENV
exec $PYTHON "$@"

Now we change the python argv to reference this frameworkpython binary, and are able to import matplotlib!
For completeness, kernel.json becomes:

{
"display_name": "timeseries_2.7",
"language": "python",
"argv": [
"/Users/gerben/.pyenv/versions/timeseries/bin/frameworkpython",
"-m",
"ipykernel",
"-f",
"{connection_file}"
]
}

Step 6: Using local pySpark in our kernel

Now let us extend this Jupyter kernel with a SparkContext. Without virtual environments one would start a Jupyter notebook with a loaded SparkContext sc as follows:

export SPARK_HOME=/Users/gerben/Projects/spark-1.5.2-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=ipython
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
IPYTHON_OPTS="notebook" pyspark

In which the exports are usually stored in your ~/.bash_profile file.
This approach however starts a Jupyter with a local Spark instance, while the goal was to have project independent kernels.
So let us adapt our custom kernel to create the SparkContext, while our main Jupyter instance has not.
We only have to change our kernel.json to start its python instance through pySpark:

{
"display_name": "pySpark timeseries local(spark 1.5.2)",
"language": "python",
"argv": [
"/Users/gerben/.pyenv/versions/timeseries/bin/frameworkpython",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/Users/gerben/Projects/spark-1.5.2-bin-hadoop2.6",
"PYTHONPATH": "/Users/gerben/Projects/spark-1.5.2-bin-hadoop2.6/python/:/Users/gerben/Projects/spark-1.5.2-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip",
"PYTHONSTARTUP": "/Users/gerben/Projects/spark-1.5.2-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master local[*] pyspark-shell --packages com.databricks:spark-csv_2.11:1.3.0"
}
}

This will start a separate pyspark for each running notebook with this kernel.
To connect to a standalone spark instance, thus sharing one instance amongs all kernels, the last line can be changed into:

"PYSPARK_SUBMIT_ARGS": "--master spark://127.0.0.1:7077 pyspark-shell --packages com.databricks:spark-csv_2.11:1.3.0"

If you’re developing in Pycharm, you need to tell PyCharm which python to use to actually get the installed dependencies.
Go to PyCharm -> Preferences -> Project -> Project Interpreter and browse to your newly created virtual environment.

Step 7: Development in PyCharm

PyCharm project interpreter settings

In your project directory, you probably want to create a requirements.txt to declare which requirements need to be available in your virtual environment.
PyCharm will install them, or you can load them using pip:

:~$ pyenv activate timeseries
:~$ pip install -r requirements.txt

Step 8: Jupyter with upstart

Everyone in your team can configure Jupyter and their virtual environments on their development machines host os.
Using your Ubuntu host os, or a vagrant box with Ubuntu, it’s easy to configure Jupyter using upstart.
To automatically start your Jupyter, use the following upstart script:

start on runlevel [2345]
stop on runlevel [!2345]
expect forkrespawnexec /usr/local/bin/jupyter-notebook --ip={{ your_advertised_ip }} \
--notebook-dir={{ root_dir_of_notebooks }} --no-browser

Save it in /etc/init/jupyter-notebook.conf, and you’re ready to go.

Conclusion

With this setup in place, it’s easy to create contained environments for your projects.
By including also the Spark configuration in the kernel configuration,
you can specify for example different Spark clusters per project/client.

bigdatarepublic

DATA SCIENCE | BIG DATA ENGINEERING | BIG DATA ARCHITECTURES

Gerben Oostra

Written by

Lead machine learning engineer at BigData Republic. https://www.linkedin.com/in/gerbenoostra https://www.bigdatarepublic.nl/

bigdatarepublic

DATA SCIENCE | BIG DATA ENGINEERING | BIG DATA ARCHITECTURES

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade