Collaborating with Jupyter notebooks
I’m constantly searching for efficient ways for collaborating on data science projects. I tend to use Jupyter notebooks for creating functional prototypes and fleshing out ideas. If the consumer of my work is only interested in my methodology and findings, I’d either export the notebook as an HTML/PDF file or create a private/public ipynb gist and share the URL (GitHub can render notebooks).
However most of the times the work happens in a collaborative environment and other data scientists or engineers contribute to it. In these cases it makes more sense to use version control. I’ve learned that keeping notebooks in GitHub repositories, and running larger batch jobs (e.g. Spark) on a shared cluster can be an effective way to go. This way team members can communicate on the version of the code that they want to run with more resources, independently view the progress and intermediary outputs, and to have one set of answers in a shared environment.
You can setup a Jupyter server by following these steps:
1- Provision a Linux AWS EC2 instance
2- Download and install Anaconda for Linux. Anaconda is a stable Python distribution that comes with Jupyter, and many other packages commonly used for data science and machine learning (e.g. pandas, scikit-learn).
$ bash Anaconda2-2.5.0-Linux-x86_64.sh
3- Run the following
$ source ~/.bashrc
4- Create a hashed password for your server
$ from notebook.auth import passwd
5- Generate a Jupyter configuration file
$ jupyter notebook --generate-config
6- Open your config file, find the line beginning with c.NotebookApp.password and replace the empty string with your hashed password (and remove # which will uncomment the line). This step ensures that your notebooks are password protected.
$ vi ~/.jupyter/jupyter_notebook_config.py
c.NotebookApp.password = u'sha1:0a8ebe35eff6:83b739ea536d457fa1aab4476ab889ed3d836a94'
7- Create a self-signed SSL certificate for protecting your password
$ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem
This step generates two files (mykey.key and mycert.pem) that we’ll reference in the next step.
8- Uncomment and update the following lines in the config file:
c.NotebookApp.certfile = '/home/ubuntu/mycert.pem'
c.NotebookApp.keyfile = '/home/ubuntu/mykey.key'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = ### SAME AS STEP 6 ###
c.NotebookApp.port = 9999
Makes sure the port you specify in this step is allowed inbound traffic in the associated AWS security group.
9- Run Jupyter in the background
$ jupyter notebook &> /dev/null &
10- Now navigate to the public IP address of your instance (51.XX.XX.XX:9999). You will get a warning that the SSL certificate is not trusted, which is fine since we self-signed it. Proceed anyway…
11- Congrats! You have a publicly accessible Jupyter server.