Pachyderm 1.3: Pipeline Performance, Embedded Applications, Support for All Docker Images, and more

Published in

Pachyderm Community Blog

5 min readDec 13, 2016

Today, we’re pleased to announce Pachyderm 1.3. Install it now or migrate your existing Pachyderm deployment.

Pachyderm 1.3 significantly improves end-to-end performance of typical Pachyderm pipelines, allows pipelines to recover from a much wider range of failures, and introduces a number of significant enhancements. Some of the major improvements and new features in the 1.3 release include:

Embedded Applications — Our “service” enhancement allows you to embed applications, like Jupyter, dashboards, etc., within Pachyderm, access versioned data from within the applications, and expose the applications externally.
Pre-Fetched Input Data — End-to-end performance of typical Pachyderm pipelines will see a many-fold speed up thanks to a prefetch of input data.
The Ability to Put Files via Object Store URLs — You can now use “put-file” with s3://, gcs://, and as:// URLs.
The Ability to Push Images when Creating or Updating a Pipeline — You can now call create-pipeline or update-pipeline with the — push-images flag to re-run your pipeline on the same data with new images.
Support for all Docker images — It is no longer necessary to include anything Pachyderm specific in your custom Docker images, so use any Docker image you like (with a couple very small caveats discussed below).

Embedded Applications

Our “service” enhancement allow you to create a long-running application within Pachyderm that (1) has access to versioned data, and (2) is accessible from outside of Pachyderm. For example, you might want to spin up a Jupyter notebook server that will enable you to interactively manipulate and visualize versioned data. Or, you might want to push versioned data to a real-time company dashboard.

This is now possible in Pachyderm 1.3 and can be enabled via the service portion of a job specification. This service field allows you to define an internal_port and an external_port. For example:

{
    "service" : {
        "internal_port": 8888,
        "external_port": 30888
    },
    "transform": {
        "image": "pachyderm_jupyter",
        "cmd": [ "sh" ],
        "stdin": [
            "/opt/conda/bin/jupyter notebook"
        ]
    },
    "parallelism_spec": {
        "strategy": "CONSTANT",
        "constant": 1
    },
    "inputs": [
        {
            "commit": {
                "repo": {
                    "name": "foo"
                },
                "id": "master/0"
            }
        }
    ]
}

When we create a job with this specification, the job will be exposed internally (to Kubernetes) on port 8888 and externally (outside of Kubernetes) on port 30888. In the case of a multi-node cluster, a provider specific mechanism can be employed to access the service (e.g. Amazon will allow you to use a load balancer service, which in turn you can assign a DNS name). It will also have access to the input repo “foo.”

We have created a Jupyter notebook example to illustrate the power of this enhancement. In the example, we attach a Jupyter server to specific versions of repos within a DAG. We can then utilize a notebook to interactively explore the data, debug unexpected behavior, and develop new analyses.

Pre-Fetched Input Data

Pachyderm 1.3 pipelines no longer use FUSE for job execution. Rather, they download input data directly to disk and write output data directly to disk (before uploading it). This provides a significant speedup for pipelines. Using this benchmark, job execution times improved by about 4x!

Put Files via Object Store URLs

The pachctl CLI now supports putting files into Pachyderm data versioning system (PFS) via object store URLs. Pachyderm 1.3 supports the use of s3://, gcs://, and as:// URLS. For example, to put a file directly from S3, you could use:

pachctl put-file <repo> <branch> -f s3://url_path

This way you don’t have to worry about downloading S3 files locally or creating some service that serves your files out of S3 via HTTP.

Push Images when Creating or Updating a Pipeline

In many cases, especially during development, our users want to update their code (and thus their Docker image(s)) and re-run their pipeline with the new code. Pachyderm 1.3 makes this a quite a bit easier.

To create or update a pipeline, for which Pachyderm has already pulled images, you just need to build your new docker image and then call “create-pipeline” or “update-pipeline” with the — push-images flag. For example,

pachctl update-pipeline -f pipeline.json — push-images

When this is called, Pachyderm will tag the newly built image, update the pipeline spec on the server with the newly tagged image name, and re-run the pipeline with the new image.

Support for all Docker Images

We are very happy to announce that you no longer have to ensure that your custom images inherit Pachyderm’s “job-shim” functionality, or any Pachyderm specific functionality for that matter. You can use your favorite Docker images without modification as long as they have cp and sh functionality (so basically any linux-based, non-scratch images), and even these requirements (cp and sh) will be removed soon.

With this enhancement, a valid Docker image for use with Pachyderm can simply look like this:

FROM ubuntu# get up pip, vim, etc.
RUN apt-get -y update
RUN apt-get install -y python-pip python-dev libev4 libev-dev gcc libxslt-dev libxml2-dev libffi-dev vim curl
RUN pip install — upgrade pip# get numpy, scipy, and scikit-learn
RUN apt-get install -y python-numpy python-scipy
RUN pip install pandas
RUN pip install scikit-learn# add our project
ADD . /

without any explicit indications or modifications specific to Pachyderm. You no longer need to use FROM pachyderm/job-shim:latest or explicitly include the “job-shim” binary. You can build directly from ubuntu, alpine, or your own custom data science base image.

Of course, if you already have images inheriting “job-shim” they will continue to work with Pachyderm 1.3.

Install Pachyderm 1.3 Today

For more details check out the changelog. To try the new release for yourself, install it now or migrate your existing Pachyderm deployment. Also be sure to:

Join our Slack team for questions, discussions, deployment help, etc.
Read our docs.
Check out example Pachyderm pipelines.
Connect with us on Twitter.

Finally, we would like to thank all of our amazing users who helped shaped these enhancements, file bug reports, and discuss Pachyderm workflows and, of course, all the contributors who helped us realize 1.3!