Accelerating Machine Learning Time to Market with GPU-powered Jupyter Notebooks

Published in

The PayPal Technology Blog

6 min readDec 3, 2019

Data scientists using Jupyter through PayPal Notebooks

Ever since PayPal Notebooks was made generally available at PayPal, thousands of analysts, data scientists and developers have made Jupyter a critical part of their workflow.

The one user segment which has been particularly active with notebooks is the data scientist segment. PayPal’s data scientists, spread across many business units, are using PayPal Notebooks to seamlessly query data across data sources — Teradata, Hadoop, Kafka, files over SFTP, etc. — and building and training models with much more agility than ever before.

At the same time, there is an increasing need to get better performance while building and training machine learning models. Many machine learning models have shown exponential improvement in performance when model training is done using Graphical Processing Units (GPU) as compared to commodity CPUs which have been common within the big data ecosystem surrounding Hadoop and Spark.

Challenges with GPUs

Despite their popularity in the field of machine learning and artificial intelligence, GPUs at scale on-premises are still a niche for most of the industry including PayPal. There are a lot of bottlenecks which impede the ability to quickly go from researching a business problem to a deployed machine learning model.

Non-standard hardware with long procurement cycles

GPU and High-Performance Compute (HPC) servers come in a large variety of configurations and as a result, are not yet standard SKUs that can be ordered and commissioned quickly for on-premises use. Procurement time for non-standard hardware can often be very long so the data scientist time to market is severely affected.

Single-tenant implementations

Since different teams have their justification for the choice of GPUs, access to each of these machines is restricted to the team which bought the hardware, resources are not utilized to the best of their potential. We frequently noticed under-utilization of resources on some machines on one hand, and requests for better resource scheduling on others with more active users on the other.

Additionally, it becomes the data platform team’s burden to keep all the environments updated with the popular frameworks and packages on each environment.

Day to day operations riddled with friction

PayPal takes data security extremely seriously, for obvious reasons. As a result, accessing production systems requires many authentications before running any analysis.

PayPal Notebooks enabled a seamless way to access analytical data stored anywhere within PayPal by tightly integrating various security controls transparently within the platform.

Without PayPal Notebooks in the GPU environment, data scientists have to run open-source Jupyter on individual machines and go through a tedious multi-step process to get basic notebooks functionality. Besides being a complicated setup, it does not offer any of PayPal Notebooks’ core features — seamlessly connecting to a variety of data stores, sharing notebooks through Github, scheduling workloads with Airflow, publishing directly to Tableau, etc.

PayPal Notebooks are now available with GPUs

We have addressed these concerns by making the PayPal Notebooks platform available in our GPU environments. We leverage a few key pieces of technology to achieve this:

Docker: We containerized all the kernels so they can be launched independently of the notebooks host
Kubernetes: To orchestrate running the kernels on various types of hosts, we leverage Kubernetes which allows us to seamlessly move our workload between commodity hardware, high-memory compute or GPU
Jupyter Enterprise Gateway: This allows running the kernel, de-coupled from the notebooks server, on a remote host instead of the notebooks host

Current PayPal Notebooks setup with highly available notebooks grid connected to NAS mounts — PayPal Notebooks — simplified view of current setup

PayPal Notebooks setup to support GPU includes Kubernetes and object storage — PayPal Notebooks with GPU support

Dockerized kernels

All the current PayPal Notebooks features are contained in a Docker kernel called PPMagics. Additionally, we have provided the following kernels:

Python
miniconda
Spark Scala
PySpark
PySpark in YARN client mode
TensorFlow CPU
TensorFlow GPU

Custom images

Besides providing standard containers for the core features, we have also created a process that allows anyone to create their own Docker container, building on any of the existing platform-provided containers. This is a huge benefit to the data scientists who want to try new packages and newer versions of existing packages without needing the notebooks platform team’s help. At the same time, the platform team only needs to ensure compatibility and support for what’s included in the base images.

Kubernetes

Kubernetes has seen an increased adoption across the industry and for a good reason. They make it easy to orchestrate the container lifecycle across the cluster. By setting up our Kubernetes cluster spanning CPU-based and GPU machines, we are now able to take any kernel and run it on any type of machine.

We also use Kubernetes to limit groups of customers to certain machines so that we can provide exclusive access to certain machines to certain teams.

Jupyter Enterprise Gateway

Jupyter Enterprise Gateway, an evolution of nb2kg and then Enterprise Kernel Gateway is a key component in making GPU Notebooks happen at PayPal. We are leveraging Jupyter Enterprise Gateway’s ability to run a kernel on a remote machine to effectively launch a kernel as a container on the Kubernetes cluster.

So what?

One of the biggest outcomes of this initiative is that we can bring data scientists, analysts and developers into the world of GPUs instantly and securely, without really changing any of the user experience. They continue to use the familiar PayPal Notebooks interface, with all the features built into the platform, and they can now run their model training on GPUs.

The cycle time is dramatically reduced due to:

No kludgy management of machines needed
No need for data science teams to run their separate instances of Jupyter
Secure and direct access to all data stores
Work is shareable among colleagues working together on common projects
Of course, access to much more powerful compute through GPU

We are not done yet

While this is a big milestone for the data platform at PayPal, it is certainly not the end of the journey. In fact, we are just getting started with notebooks evolution at PayPal. Some of the items we are currently working on include:

Provide a choice of custom container sizes so each customer can choose what kind of environment they would like to use (high memory, high CPU),
Support analytical model deployment,
Build AutoML pipelines,
Create workflows and manage workflows, and much more!

(We plan to publish more in-depth posts on the internals of what we have enabled so far. Please stay tuned!)

We are incredibly excited to have delivered this to PayPal’s data community. If you want to learn more about the data platform, please let us know and we can talk. PayPal is always hiring, so if any of the above interests you and if you would like to be a part of this journey, please check out open positions at PayPal.

Many thanks to Vadim Kutsyy and Prabhu Kasinathan for reviewing this post and providing feedback.