PayPal Notebooks, powered by Jupyter: Enabling the next generation of data scientists at scale

Published in

The PayPal Technology Blog

8 min readSep 18, 2018

PayPal is a data-driven company. Whether building fraud detection with advanced algorithms or customizing user experiences by leveraging clickstream analytics to customize user experiences, PayPal has relied heavily on data for all aspects of business. Naturally, with the explosion of cost-efficient big data technologies, the appetite for data has only grown. As our data grows, the ability of our data scientists and researchers to glean actionable, near real-time insights from that data gets even more challenging.

Do PayPal’s data scientists and researchers have access to all its data? Most analysts proficient in running complex SQL queries on relational data warehouses built on data stores like Oracle and Teradata are not comfortable writing Map Reduce or Spark code to read the behavioral data stored in Hadoop. They are also unlikely to write code to read from near real-time data pipelines, like those established with Kafka.

This is where PayPal Notebooks, bolstered by the power of Gimel, come in. Built on the popular and versatile open-source project Jupyter, PayPal Notebooks empower PayPal’s data community with fast, simple and easy access to large volumes of data collected across a variety of data stores.

Jupyter notebooks provide anyone a way to run live code in over 40 languages, include visualizations and rich marked up text all in a web application. Since the output of the code execution is maintained within the same notebook, it makes for a collaborative environment for data scientists and data engineers to work together.

This post will walk you through how the Data Platform team has turbo-charged Jupyter at PayPal.

A brief history

Before we jump into PayPal Notebooks, it is worth looking at where we are and how we got here. PayPal’s data scientists had started using Zeppelin several years ago, but they were using it with single host deployments or running it on shared nodes to make it a multi-user deployment.

The data platform team took the plunge with Jupyter in Q2 2017 and set it up as a platform. PayPal Notebooks started as an internal beta in Q3 2017 with around 50 users, who were mostly using Spark and PySpark. In February 2018, PayPal Notebooks became generally available to internal users and had around 100 users.

Today, PayPal Notebooks engages well over 1,300 users, a majority of which use SQL (both Hive and Teradata), Spark/PySpark and Python. Recently, we enabled the R kernel, which is drawing in many more analysts and data scientists.

From Jupyter to PayPal Notebooks at PayPal

Data Platform Driving Forces

As a platform team, we are driven by platform architectural principles like security, availability, scalability, reliability and observability. Every product we build is driven by these “-ilities” and is laser-focused on improving efficiency and productivity.

Within the data platform, we target 4 types of customers: data engineer, data analyst, data scientist and infrastructure operators. To improve productivity of these customers, we are constantly looking to find ways to reduce the time it takes them to go to market with their innovation.

PPExtensions

With growth in the variety of data stores supported, the volume of data available and in customer demand, the data platform team created a series of Jupyter extensions called PPExtensions. These extensions add key features to Jupyter that, by abstracting complexities, simplifying data access, enabling self-service deployments and easing collaboration and knowledge reuse, spur innovation at PayPal.

PPMagics — A Collection of Jupyter Magics

%hive and %teradata: These magics allow you to easily query Hive and Teradata tables without needing to remember how to connect to the various Hive clusters or Teradata systems. In addition, they also provide the ability to directly load a csv file to a Hive or Teradata table. As you will read later, these magics also enable directly publishing the output to Tableau.

%csv: CSV is the data file format of choice among our data scientists. The %csv magic enables SQL commands on CSV files, so you can now run SQL on a file just like you would run SQL on a table. It supports all ANSI SQL functionality, so you can see the distribution and quality of your data in the CSV files without needing to load them into a table.

%run: One of the most common requests from our customers after they get proficient with Jupyter is the ability to run one notebook from another. This is usually so that they can modularize their work and have notebooks for specific purposes that can be maintained separately. %run is created to simply allow running notebooks from other notebooks. It allows the ability to pass parameters to the notebook at the time of running the notebook, as well as an option to run multiple notebooks in parallel.

%run_pipeline: The %run_pipeline magic is similar to %run but has one key differentiator — unlike %run, %run_pipeline allows sharing state between the notebooks and thereby being able to pass data through the pipeline. Notebooks in the pipeline are run sequentially.

%sts: This magic allows you to easily connect to Apache Spark Thrift Server and run SQL commands. This magic, like %hive and %teradata, has options to publish directly to Tableau.

In the hope of democratizing data, PayPal needed to overcome several hurdles. One major hurdle arose from the fact that data at PayPal is stored in disparate data stores across multiple locations, making analysis across these federated stores a manual and painful process. Gimel fixed this by providing a single API to access any data in any data store in any location. To maximize adoption, we adopted SQL as our de-facto API language. For more info on Gimel, refer to our previous post.

At PayPal, by integrating Gimel with Jupyter, we provide in-notebook execution of any SQL against any data store, whether persistent (e.g. Elastic or HBase) or streaming (e.g. Kafka). This is possible through the %sts magic.

%presto: This magic is just like %hive or %teradata in that it enables our customers to access Presto directly without needing to remember configuration and connectivity parameters.

In addition to creating multiple magics, PPExtensions also support a list of other engaging features.

Native publishing to Tableau

An analyst’s workflow typically involves finding the data, mashing it up and creating an enriched dataset, and then publishing their insights to a dashboard to share it with management and other stakeholders. The %publish magic takes the result of a query, or even a plain dataframe and publishes it to Tableau directly.

Now, analysts can fetch data, process it and publish the outcome all in one flow and within a single product without needing to switch contexts.

Notebook Version Control and Collaboration/Sharing through GitHub

As a naïve means to promote notebook sharing, Jupyter allows users to export native notebooks in both a notebook native format and in various 3rd party formats such as PDF or HTML. But by doing so, the customer loses the ability to implement version control and to later merge changes in a collaborative fashion.

To foster collaboration while ensuring integrity, we looked to GitHub. Today, our customers can select any notebook they’d like to share and push it to GitHub. The extension provides a link to the posted notebook which can then be shared among the team. Every time a notebook is pushed with the same name, a new version appears in GitHub.

Similarly, anyone call pull from GitHub by browsing the user directory and choosing a notebook of interest. This is a powerful way to not just maintain clear version history of all notebooks, but also to increase knowledge sharing and code reuse.

Sharing notebooks natively through Github

Repeatable Analysis through Scheduled Notebooks

Another common ask from data scientists and analysts is the request to schedule notebooks on a periodic basis. Whether the aim is to produce daily/weekly/monthly summary reports or to run model training, our PayPal Notebook scheduling extension enables a native integration with a popular open-source scheduler, Apache Airflow. In addition to providing the ability to schedule the notebook, there is functionality to run the notebook in the background which uses a home-grown smart credential management system which supports automatic, secure credential filling.

The scheduling extension also creates a new tab in the Jupyter UI where all the key schedule-related information is brought in from the Apache Airflow dashboard. Clicking through on the notebook job name will of course bring you to the Airflow dashboard for more granular information about the job.

PPExtensions Open Sourced

At Jupytercon 2018, we announced that we have open sourced PPExtensions. You can install it with pip and you can check out the code on our GitHub repo.

See below for more information:

pip install ppextensions

Project page: ppextensions.io

GitHub: https://github.com/paypal/ppextensions

Google Group: https://groups.google.com/d/forum/ppextensions

Slack: https://ppextensions.slack.com (Join using this link)

This is just the beginning. We plan to add all our extensions to PPExtensions, and all our future work will be released as open source as well. In terms of what is coming up:

We cannot wait for you to try PPExtensions. Please reach out to us if you have any questions about the usage or contribution.

Thanks to Sid Anand for providing feedback on the post.