Pachyderm v0.4: git and s3 integration

Published in

Pachyderm Community Blog

3 min readFeb 13, 2015

A key principle of modern open source projects is that they need to integrate smoothly into users’ already-established workflow. We realize that learning a new interface is a huge cognitive burden for our users so we’re always looking to avoid it.

This time last month, we shipped the first version of Pachyderm MapReduce (pmr). The MapReduce engine works, but the interface was a significant barrier to using it effectively. We spent this month dogfooding that interface so that we could come up with a better one that felt natural. At the end of the day we realized it was staring us in the face all along. Writing a computation pipeline is still just writing code and the only tool we wanted to use for managing our code is git. So that’s the interface we went with.

In Pachyderm v0.4 jobs are launched with `git push`.

The spec

The first thing we needed to do was figure out how we were going to represent pipelines. We knew we were going to be pushing this around using git so we wanted to stick to a simple directory structure with text files. A first draft of the spec is available here. The tl;dr is that each repo comes preloaded with:

a DAG of jobs which define the dependencies between them
a Dockerfile which can be built in to an image to implement the logic for each job.
A set of scripts which can be used to install the pipeline on various engines.
An optional sample of the data which can be used to test images locally to further speed up development.

We’ve made the pipeline from our chess demo into a functioning example.

Installing a pipeline

We wanted to make installation only one command harder than cloning the repo. A short setup time is just as important as short dev cycles. This is what you do to install the chess repo on a running cluster:

This will give you a remote called pachyderm that you can push to with `git push pachyderm master`.

What happens when you push

When you push, the post-receive hook on the server gets triggered and:

Builds an image using the code in the repo
Pushes that image to a local Docker registry in the Pachyderm cluster
Copies the DAG of jobs into the cluster
Kicks off the pipeline!

These computations are all done in their own pfs branch so your job is isolated from the rest of the cluster. You can share results with others by sharing the link returned by the hook.

Pachyderm MapReduce over S3

Pachyderm MapReduce jobs can now take input directly from s3! This was by far our most commonly requested feature so we were happy to add it. Many people use S3 as their catch-all persistent storage layer and we wanted to minimize the friction for doing computation over that data. Pachyderm and S3 now work in tandem to offer the best of both worlds! Read more here.

We’ve loaded the dataset for our chess demo in to a public bucket, s3://pachyderm-data/chess. The chess repo uses this as its input so you can run it without downloading any data. Clone it now and start analyzing games!

We also made a number of other improvements including local Docker registries inside your cluster which greatly speeds up how fast you can pull jobs. We use local Docker registries and GitHub integration for internal development on Pachyderm too!

Pachyderm v0.4: git and s3 integration

The spec

Installing a pipeline

What happens when you push

Pachyderm MapReduce over S3

Written by Joe Doliner