Deploying a scalable, shared Data Science platform at Université Laval

Guillaume Moutier
Sep 3, 2019 · 10 min read

TLDR: just to make sure you won’t leave after the first 42 paragraphs, we’ll talk about the integration of JupyterHub, OpenShift, Ceph, Vault, Keycloak,…

What’s this?

Let’s start with some context!

Who we are: Université Laval is located in Québec City, Canada. The very first French-language university in North America (originating back to 1663), ULaval has 43.000 students, 9.300 employees, and ranks 8th out of the top research universities in Canada.

The project: Valeria aims at solving different problems that researchers and students are faced with for their data analysis:

  • Tooling: a lot of tools are available out there. Many. Too many… If you’re already a data engineer or data scientist, you know everything about them (or you try…). But let’s say you’re a biologist, a geologist, an historian, with a small or non-existent background in IT, but a great idea of something you could do with data. How do you start? How do you find the right program, library, or application to use? How do you install it, with all the right prerequisites, the right versions, the compatibility lists,…?
  • Ressources: now that you know what to do, the next question is “Where?”. Yes, your laptop’s fine, but if you want to scale a little bit, or try to manipulate some heavy data, acquire them 24/7, or simply store them, you may face some issues. For some science fields a USB key is OK, or even a floppy disk (yes, you’d be amazed to see that some labs look likes museums). But if you work in genomics for example, where datasets come by tens or hundreds of terabytes, that’s another game.
  • Collaboration: more and more projects would like to cross-reference data, combine them from different sources, even from different fields. Or simply expand the datasets with more data shared by colleagues. Researchers need a way to make this process of discovering and sharing data easier.
  • Security: last subject, but definitely not a small one, how do we ensure that the data stays “secure”, which means maintaining its confidentiality, availability and integrity? That relies of course on a strong architecture and thorough procedures, which involve control, maintenance, audits,… Once again, definitely not your specialty when you are a geographer.

To sum it up, as a Laval’s Vice-Rector said:

“We want our researcher to search, not to tinker with systems and applications.”

So Valeria was built to offer local services to collect, store, process, analyze and share data in a simple and secure way… Without bothering with technology! In short, Valeria is:

Now for the tougher part…


Architecture

Note: I know there are a lot of solutions and providers on the market, and you can always say “Why did you not use this or this? My brother-in-law knows a better solution!”, but we had to follow some guidelines:
- Everything had to be on-prem: for regulations/security (some data are really sensitive, especially in medical science), efficiency (with Petabytes of data you can’t have them move around too much), and cost-effectiveness (ULaval is lucky enough to have 4 datacenters on the campus, with 2 of them being Tier-3, so it’s still much cheaper than going to the Cloud for most workloads).
- Storage had to be separated from Compute. As Valeria is to be used by all researchers and students, it’s clearly impossible to forecast the use of resources. Therefore those two have to scale independently.
- Whenever possible, solutions have to be Open Source. It’s easier to integrate, expand upon, patch,… And in this area where things are moving really fast² (that’s not a note link, that’s an exponent!), you want to keep all your options opened.
- And the most important thing, everything has to be based on open and recognized standards to make sure it can be easily integrated.

Now…

There are many elements in Valeria, some are still under development, some will come later. So I will focus here on the core of the platform:

  • The Datalake: we wanted our researchers to be able to easily share data, with security and high availability. We also wanted to make it easy to interact with storage, even at long distances (remote labs, or sensors in the field), from any type of device, platform or language. A central object store, S3-compatible, seemed the right way to go and we chose Ceph (we had been looking at it and experimenting with it for a while).
    2.8 PB have been deployed for this project as a “starter”, spread over the 4 datacenters (don’t worry, they share the same leaf and spine network @100Gbps between them, no latency), with an 8+4 erasure coding profile.
  • The Compute part: it was clear from the start that Jupyter notebooks were ideal to provide easy to use environments where researchers would be able to use different languages, libraries,… JupyterHub was also there to handle the multi-user aspect. But how to make it the most efficient and resource-wise? Kubernetes to the rescue! And more specifically OpenShift, on which we were already working for more “standard” development.
    The project deployed 25 nodes with 256GB of memory with 2x14-cores CPUs, with some of them being GPU-equipped (V100).
  • ID Management/Authentication: with the different tools that we would be using, we had to be able to manage OAuth/OIDC, SAML... Keycloak seemed to be a good choice, that we validated after some testings.
  • Secrets Management: S3 is great, but it uses secrets for connections to buckets. And of course we don’t want people to put those secrets directly in their code, or any permanent file that could be accessed or compromised. To solve this problem we use Vault from HashiCorp, and access those secrets at runtime only.

So here is how it looks like from 3000 feet:

  • Users connect to the platform through a web portal (developed using PatternFly).
  • From there they have access to different applications (and help, documentation, onboarding,…): CKAN as a data catalog, GitLab to store and version code or programs, and JupyterHub.
  • Every access is authenticated and secured using Keycloak, and everything runs on OpenShift.
  • From JupyterHub, users can launch different “flavors” of notebooks depending on the kernel and the libraries installed (basic Python, SciPy, Tensorflow, Dask, R). And depending on their access level (researcher, student,…) they have different quotas of resources available (CPU, RAM, GPUs).
  • With the help of Vault (more on that later), the notebooks are automatically spawned with the user’s uid, and injected with the user’s object storage credentials. That allows the notebook to be automatically connected to: 1. the user /home and /scratch folders that are exported through NFS from a Lustre filesystem; 2. all the buckets that the user owns in the Ceph datalake.
  • Standard use of the platform is that all user’s programs are saved in its /home folder, temporary data can be saved in /scratch (automatically erased after 30 days), while the data itself is stored in the Datalake.
  • Users can also send and receive data to/from Compute Canada, the array of supercomputers accessible to all Canadian researchers. This is done via Globus for fast parallel transfers.
  • Last part (not implemented yet, WIP): to give more flexibility with the kernels and libraries used in the notebooks, ULaval is working to integrate Valeria with the CVMFS “repo” used at Compute Canada. That will allow for lightweight and easily customizable kernels. If you are interested in this part reach out to me and I’ll get you in touch with the people working on this.

Technical details, implementation

I’ll start with some good news, everything is available at ulaval GitHub repos: https://github.com/ulaval/valeria-jupyterhub and https://github.com/ulaval/valeria-jupyter-notebooks-s3.

Now for the bad news… This platform is tailor made for Valeria. It answers ULaval needs, and takes advantage of what was already implemented and available. So it’s definitely not designed as a turn-key solution. If you’re looking for that kind of things, make sure to have a look at projects such as Kubeflow or Open Data Hub (I may be biased on this one, I know, more on that later…).

However, maybe there are some recipes or solutions here that you may find of interest, so I’ll focus on those for the end of this article.

JupyterHub on OpenShift

First and foremost, our work was heavily based on the fantastic work done by Graham Dumpleton with Jupyter-on-OpenShift, especially JupyterHub-Quickstart. Please read what’s on this project for more details on how this part works. I won’t detail it here as Graham has already documented it quite thoroughly!

Vault, JupyterHub and Keycloak integration

This is the real tricky (or interesting) part, so bear with me...
When a user is created in Valeria, he gets a uid in the environment where he has his /home and /scratch directories (on the Lustre filesystem). When his account is created in Ceph, he also gets an access_key and a secret_key. We store those 3 elements in a Vault instance, at this path (valeria being a kv secret store):
valeria/data/users/{{identity.entity.id}}/

Noticed the {{identity.entity.id}} path? That’s where a dynamic policy attached to it will allow us to make sure that only a user authenticated with Keycloak with this specific id will be able to retrieve a secret stored in this branch. And not a know-it-all backend that would have access to all the keys, with all the security concerns it brings…

So after that, when a user launches a notebook, what we do is that we reuse the JWT that the user obtained from Keycloak to access JupyterHub, to authenticate against Vault this time and retrieve the uid and the S3 credentials. That involves a custom OAuthenticator class that may be too long to paste here, but here is the direct link. The uid is used to spawn the notebook (see below), and S3 credentials are setup as env vars in the notebook for later user (see a little bit more below…).

Of course many other things could be stored securely this way in Vault, and used only when needed by JupyterHub or the launched notebook.

Connection to S3 and NFS (simultaneously)

Here I have first to acknowledge the work from PGContents for the HybridContentsManager and S3Contents for the S3ContentsManager. They are the two parts that allow this magic to happen.

Connection to NFS is quite easy. First you have to create a service account and custom scc that will allow the notebook pod to mount the NFS volume (detailed in the README file on the repo, but basically it’s the Restricted policy+NFS volume). And then in jupyterhub_config:

# Setup persistent storage on NFS
c.KubeSpawner.service_account = 'notebook'
nfs_server = os.environ.get('NFS_SERVER')
nfs_path = os.environ.get('NFS_PATH')
c.KubeSpawner.volumes = [
{
'name': 'home',
'nfs': {
'server': nfs_server,
'path': nfs_path
}
}
]
c.KubeSpawner.volume_mounts = [
{
'name': 'home',
'mountPath': '/users'
}
]
# Start Notebook in user directory
c.KubeSpawner.notebook_dir = '/users'
c.KubeSpawner.default_url = '/tree/home/{username}'

It works well because we have first made sure that the notebook would run with the right uid with these lines:

secret_version_response_uid = vault_client.secrets.kv.v2.read_secret_version(
mount_point='valeria',
path='users/' + vault_entity_id + '/uid',
)
spawner.uid = int(secret_version_response_uid['data']['data']['uid'])

That way, the user has access to its own home directory, but also to other directories for which access may have been granted. From Jupyter he sees only these ones. From a terminal he can list all directories, but cannot enter them. That’s an interesting approach from a collaboration perspective, best suited than mounting a PVC for each user.

Connections to S3 are made on the notebook side. Remember that we have injected the S3 credentials in the notebook at spawn time? That’s were it’s used. The HybridContentManager assigns different file/content manager depending on the path you request. For the standard filesystem (the one we mounted through NFS) it’s straightforward. For S3 we first find all the buckets a user has, and we configure a S3ContentManager for each one, under the datalake_bucketname path.
So in the jupyter_notebook_config.py file we have:

#######################
# Directories mapping #
#######################
import boto3
from s3contents import S3ContentsManager
from pgcontents.hybridmanager import HybridContentsManager
from notebook.services.contents.filemanager import FileContentsManager
# We use HybridContentsManager (https://github.com/quantopian/pgcontents),
# FileContentsManager for accessing local volumes
# and S3ContentsManager (https://github.com/danielfrg/s3contents) to connect to the datalake
c.NotebookApp.contents_manager_class = HybridContentsManager
# Initialize Hybrid Contents Manager with local filesystem
c.HybridContentsManager.manager_classes = {
# Associate the root directory with a FileContentsManager.
# This manager will receive all requests that don't fall under any of the
# other managers.
'': FileContentsManager
}
# Get S3 credentials from environment variables
aws_access_key_id = os.environ.get("AWS_ACCESS_KEY_ID")
aws_secret_access_key = os.environ.get("AWS_SECRET_ACCESS_KEY")
endpoint_url = os.environ.get("S3_ENDPOINT_URL")
# Add datalake connection information
if (aws_access_key_id and aws_access_key_id!=None): # Make sure we have usable S3 informations are there before configuring
# Initialize S3 connection (us-east-1 seems to be needed even when it is not used, in Ceph for example)
s3 = boto3.resource('s3','us-east-1',
endpoint_url=endpoint_url,
aws_access_key_id = aws_access_key_id,
aws_secret_access_key = aws_secret_access_key,
use_ssl = True if 'https' in endpoint_url else False ) # Provides for test environment with no https
# Enumerate all accessible buckets and create a folder entry in HybridContentsManager
for bucket in s3.buckets.all():
c.HybridContentsManager.manager_classes.update({'datalake_'+bucket.name: S3ContentsManager})
# Initalize arguments for local filesystem
c.HybridContentsManager.manager_kwargs = {
# Args for the FileContentsManager mapped to /directory
'': {
'root_dir': '/users'
}
}
# Add datalake connections arguments
if (aws_access_key_id and aws_access_key_id!=None):
# We don't have to reinitialize the connection, thanks for previous "for" not being scoped
# Enumerate all buckets and configure access
for bucket in s3.buckets.all():
c.HybridContentsManager.manager_kwargs.update({'datalake_'+bucket.name: {
'access_key_id': aws_access_key_id,
'secret_access_key': aws_secret_access_key,
'endpoint_url': endpoint_url,
'bucket': bucket.name
} })

Conclusion

I will finish by something that will also use as a disclaimer. Valeria has been developed at Université Laval by a fantastic team that I had the chance and the pleasure to lead as Director of Architecture and CTO. I recently left ULaval to join Red Hat, but I am really proud of the work we’ve done on Valeria for the past two years. I’m sure that this will be a great asset for researchers and students, that will allow them to embrace more easily this amazing world of Data Science.

Guillaume Moutier

Written by

Hi! I am a Senior Principal Technical Evangelist working @ Red Hat. Containers, Storage, Data Science, AI/ML, that’s what it’s all about!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade