Collaborative research at Patient Engagement Lab

Published in

Patient Engagement Lab

4 min readSep 27, 2019

Good news! We are growing! Praekelt Foundation is a remote working company with employees around the globe. This means there is no central IT department that takes care of our hardware. We all are free to use whatever OS we want, which makes remote collaboration a bit more tricky. Some of us work on Windows and some on MacOS. Being a distributed team we need to be able to re-run data analysis or an experiment and with different working environments we can’t just download the same csv file from the server. Last year we expanded our Data Science and Research teams by adding 5 more members! Very soon we realized our current tools for collaboration were not sustainable any more. We were using a single docker container running RStudio and Jupyter Notebook and authentication using ssh (+port forwarding) or sshtunnel (a poor persons VPN!) and managing usernames was becoming difficult to keep up with.

Our requirements for the new setup were as follows:

Accessible irrespective of OS.
Easily add and remove users.
Persist user files in cases when: a browser is closed or the internet connection is lost.
Keep all our services in one place.
Automatically pull our private GitHub repos to user space.
Add proper authentication of users.
Containerized environments that allow a process to be easily restarted on different machines with minimal installation processes

We started googling around, looking for the best solution to our problems and we quickly learned about Jupyter Hub. Jupyter Hub, as the documentation states, is a multi-user Hub that:

spawns,
manages, and
proxies

multiple instances of the single-user Jupyter Notebook server. Jupyter notebook is one of the most popular tools among data scientists and we are not an exception. The core supported programming languages are Julia, Python, and R with many community maintained other languages. We use RStudio extensively in our work rather than R kernel available in the notebook. Since Jupyter Hub is very customisable, we were able to serve RStudio through jupyter-rsession-proxy and Jupyter Lab through jupyterlab-hub extension. It wasn’t our original requirement for our new system, but a cherry on the cake nonetheless. Our deployment closely followed instructions from the reference deployment of JupyterHub with docker using docker-compose.

Jupyter Hub also makes user management very easy. As admin you can use the UI to add and remove users.

As mentioned before, we were originally connecting to Jupyter Notebook server via sshtunnel. Since some of our new colleagues are Windows users, and sshtunnel doesn’t support Windows, the need for a new system was urgent. With Jupyter Hub, we were able to easily set up a dedicated URL to eliminate the need for a secure tunnel. Still, we collect sensitive health data and our data access need users to authenticate. Jupyter Hub uses OAuthenticator for this purpose and offers several authentication services. We settled on the GitHub authenticator as we can use the login credential to interact with GitHub.

So far, we have checked a few of the boxes from our requirements list:

Our users can access Jupyter Hub through a dedicated URL
Jupyter Lab and RStudio can be found in one place
We authenticate users with GitHub and
We containerise our environments.
Adding and removing users on the go.

Let us spend some time going through our containers. Our RServer docker image is based on jupyter/r-notebook with a few extra libraries that we use on an everyday basis and jupyter-rsession-proxy. Our Jupyter Lab image is based on jupyter/scipy-notebook. Again, we add few extra packages, like rapidpro-python for our investigations and enable jupyterlab-hub extension. We are able to offer users choice of different containers thanks to “image_whitelist” in DockerSpawner since release 0.11.1.

Both images are stored on PEL Docker Hub service and both use a custom dockerentry-point.sh script. This is why our images might not work for you. You can still make your own using the following code snippet for RStudio:

and for Jupyter Lab:

At PEL we store and manage our software using private GitHub repos. To make Jupyter Hub as user friendly as possible we wanted to pull specific data analysis related repositories when the user space is created. We achieve it with the help of our custom dockerentry-point.sh I mentioned before. To interact with GitHub seamlessly, especially being able to push and pull to a repo, a git configuration file has to be populated with user details. That includes emails, names and tokens for private repos. Since we are already authenticating our users with GitHub we just have to find a way to persist login details in the newly created Docker container. We achieve it with the following code snippet included in jupyterhub_config.py:

Now, the environment variables are visible in the user container. We wrote a small bash script to populate the git configuration file:

For those changes to be applied in the spawned user server, you have to include the following lines in your Dockerfile:

Last but not least, let’s talk about persisting user data. All files created by users inside a container are stored on a writable container layer. Docker offers three ways to store files on the host machine:

Volumes
Bind mounts and
Tmpfs mount

Volumes are the preferred way to persist data and they can be encrypted among other things.

We are not done improving our set up. Next step is to move it to kubernetes.

Collaborative research at Patient Engagement Lab

Written by Monika Obrocka