Jupyter + Pachyderm — Part 1, Exploring and Understanding Historical Analyses

Published in

Pachyderm Community Blog

8 min readJan 3, 2017

Note: This post/example has not been updated for the latest Pachyderm versions (1.4+). Please contact us (via email support@pachyderm.io or chat on our public Slack) to learn about 1.4+ Jupyter integrations.

Jupyter (and increasingly nteract) notebooks are ubiquitous in data science. They are shared between team members, referenced in blog posts, used to generate visualizations, and used to teach various data-related concepts. No doubt, these combinations of textual notes, pictures, and live code snippets are useful. However, as a friend once expressed to me:

“in some ways, Jupyter notebooks leave out one of the best attributes of a ‘scientific’ lab notebook: a [theoretically] permanent chronological record of work — preserving that record, in logical as well as chronological order, is a big step towards making [data] science more like science.”

In other words, the multi-format, exploratory functionality of Jupyter could be that much more powerful if there were a system, with which Jupyter could be paired, that would enable Jupyter notebooks to interact with chronological records of works and/or be versioned themselves. Such a system would go a long ways to enabling true scientific collaboration in both commercial and academic settings.

… enter Pachyderm! Pachyderm, with its data versioning plus data pipelining functionality, can expand the possibilities and increase the significance of applications like Jupyter and nteract by providing:

A logically and chronologically ordered record of analyses with which notebooks can interact (via Pachyderm’s data versioning and provenance functionality), and
A way to version work done within notebooks themselves (i.e., to save the state of notebooks over time) along with all of the corresponding input/output data.

In addition, for those that are building a DAG of processing steps, implementing ETL pipelines, deploying machine learning models, etc., a Jupyter + Pachyderm system will allow engineers/scientists to attach interactive notebooks anywhere within a data flow and with access to any input/output data. They can then utilize the exploratory data analysis and visualization capabilities of Jupyter to debug complex pipelines or easily develop additional pipeline stages.

In this post, we will explore how Jupyter + Pachyderm can be utilized to explore and understand historical data analyses, which is related to the first point, (1), mentioned above. In a follow up post, we will explore point (2) and versioned notebooks.

An example chronologically ordered record of analyses via a Pachyderm data pipeline

In this post, we are going to imagine that we are working for a bikesharing company, like citibike. We track how many bike trips are taken each day on our service. Then we calculate our daily sales by multiplying that trip number by a trip price (e.g., $5.00). I’m sure this is NOT how citibike, or similar companies, calculate their sales, but it will give us a simple chronological data processing pipeline for this post.

Further, we are going to imagine that we are gathering some weather data for our company dashboard or some other internal service. That is, we are gathering this weather data for NYC daily, but not necessarily using it in our pipeline that calculates sales.

We will handle our data storage and processing with Pachyderm’s file system and pipelining system. The tracked counts of bike trips can be versioned using Pachyderm’s data versioning in a data repository called trips, and the daily weather data can be versioned in a data repository called weather. As we commit daily files into these data repositories, we are creating a versioned, chronologically ordered record of the trips and weather on any given day in the history of our analyses. Further, we can trigger a Pachyderm pipeline on new commits to trips, where the pipeline calculates our sales, or revenue, numbers and outputs results to another data repository called sales.

Altogether, the data repositories and processing steps look like this:

The pipeline specification defining the above processing, along with the actual program (and corresponding Docker image) used to calculate the sales, can be found here and are further explained here. The daily counts of bike trips were retrieved from citibike’s public data sets, and the weather data was gathered from the forecast.io (now Dark Sky) weather API.

A historical analysis problem we should investigate

After committing the trip and weather data into their respective data repositories and running our Pachyderm pipeline, we can plot our sales over time. When we do this, we find the following behavior (e.g., by manually plotting a sales.csv file generated by the pipeline with pandas):

There were a couple of days at the end of July (July 30th and 31st) that had particularly poor sales. How do we explain this behavior? How can we look back into our historical record of analyses and explore the situation on those days? Do the poor sales reflect an error in our processing or is there some more natural explanation?

Well, we can investigate all of these questions quite elegantly by attaching a Jupyter notebook to our data repositories. This will allow us to interactively explore, visualize, and manipulate the data at any state in history and at any points in our processing DAG.

Attaching a Jupyter notebook to the DAG at a certain point in history

Specifically, we can attach a Jupyter notebook to our versioned data using a Pachyderm service. This “service” will allow us to embed an application, in this case Jupyter, into Pachyderm. The embedded application will then have access to versioned data on particular commits of that data and can be accessed from outside of Pachyderm (i.e., in a browser).

We are going to attach to the sales and trips repo to try and diagnose why we are seeing low sales on July 30th and 31st. In addition, let’s attach to the weather repo, because we might suspect that the weather had something to do with the poor bike sharing sales on those days. Note, we could attach anywhere within a complex DAG using these methods, without having to have some pre-existing connection between the various pieces of the DAG to which we are attaching.

The job specification to launch the Jupyter service is as follows:

{
    "service" : {
        "internal_port": 8888,
        "external_port": 30888
    },
    "transform": {
        "image": "dwhitena/pachyderm_jupyter",
        "cmd": [ "sh" ],
    "stdin": [ "/opt/conda/bin/jupyter notebook" ]
    },
    "parallelism_spec": {
        "strategy": "CONSTANT",
        "constant": 1
    },
    "inputs": [
        {
            "commit": {
                "repo": {
                    "name": "trips"
                },
                "id": "master/30"
            }
        },
        {
            "commit": {
                "repo": {
                    "name": "weather"
                },
                "id": "master/30"
            }
        },
        {
            "commit": {
                "repo": {
                    "name": "sales"
                },
                "id": "<output-commitid>/0"
            }
        }
    ]
}

We are attaching to the trips and weather repos on the master branch at commit number 30, which is the commit corresponding to July 31st (the last day of interesting poor sales). Also, <output-commitid> can be replaced by the output commit ID in the sales repo that corresponds to commit 30 on the input repo trips. In other words, the <output-commitid> is the sales results that have the “provenance” of the July 31st trips data (this commit ID can be found with pachctl flush-commit). With these configurations, we are viewing a snapshot of the input/output data on our DAG along with the weather data at a point in time corresponding to the days of poor sales (July 30th and 31st).

Under the service field in the job specification, you can see that we will be exposing the Jupyter application on ports 8888 and 30888 internally and externally, respectively. Also, we will be using a pachyderm_jupyter image that includes some useful things like pandas, matplotlib, etc. (built from this Dockerfile and available here on Docker Hub).

For more on data provenance in Pachyderm, check out this article. For more on Pachyderm services, read through our docs on the subject.

Exploring and understanding historical analyses with the attached notebook

(the Jupyter notebook described below can be found here)

Now with the Jupyter service up and running, we can open up a browser and start diagnosing the sales issues. When we navigate to our exposed port in a browser, we can see the familiar Jupyter file browser, except, in this case, we can see our attached data repositories and an out directory for the job itself:

After opening a Python notebook, we can import the sales, trips, and weather data just as we would from any other file system. For the sales data (committed to the sales repo as a CSV file sales.csv), we simply use pandas to read the file from /pfs/sales:

The trips, sales, and weather data can also be merged on the respective days. This way we can attempt to determine if weather might have played a role in the poor sales:

Remember, here we are merging snapshots of the historical, versioned data on the days of interest. If we wanted to, we could combine any data repositories at any commits to experiment with different states of the data or debug analyses. This is a powerful way to blend Jupyter style exploratory analysis with a chronologically organized, versioned record inputs to and outputs from existing analyses.

Now, returning to the problem at hand, let’s visualize the daily trips, sales and weather data:

We can see that on the last couple days of July (the 30th and 31st), the precipitation probability was 70%+ in NYC. These were definitely not great biking days, and the weather likely contributed to the lows sales and dip in trips on those days. We can rest easy that our pipeline is exhibiting the expected behavior, and pass off at least some of the blame to mother nature.

Conclusions and resources

Sweet! Data versioning plus Jupyter notebooks allowed us to quickly combine historical input/output data and gain some insight into some unexpected behavior.

Generally, Pachyderm can enrich interactive Jupyter and nteract analyses by providing both a logically and chronologically ordered record of analyses with which notebooks can interact. This allows data scientists and engineers to quickly and interactively debug unexpected behavior, develop new stages for data pipelines, and more.

To learn more about the above analysis and running Jupyter in Pachyderm:

Check out the above-reference Jupyter notebook,
Read our docs on Pachyderm services, and
Join the Pachyderm Slack team to discuss your use cases.