Pachyderm 1.7: Graphical pipeline builder, new structure for versioned data, official Python client, and more!
For those new to the project, Pachyderm is an open source and enterprise data science platform that is enabling reproducible data processing at scale. We have users implementing some pretty amazing production pipelines for AI/ML training, inference, ETL, bioinformatics, financial risk modeling, and more. For example, the US Department of Defense’s DIUx organization is using Pachyderm to power their xView satellite imagery detection challenge (see figure below), which was recently featured in Wired. Now with the release of version 1.7, we can’t wait to see what our users will build!
The major improvements in the Pachyderm 1.7 release include:
- A graphical pipeline builder — Build new data pipelines quickly and intuitively from the Pachyderm dashboard.
- A new structure for organizing versioned data — Maintain robust pipelines subscribed to changes in data and easily recover from “bad” updates to data.
- Official support for the Pachyderm Python client —Integrate Pachyderm data and pipelines into any Python-based application and manage data, pipelines, access controls, and more directly from Python.
- More granular pipeline controls — Easily control job/data timeouts and resources utilized by pipeline stages.
Graphical pipeline builder
Not everyone wants to build and manage data pipelines from the command line or via language clients (e.g., our new Python client). Sometimes data scientists/analysts need to quickly set up pipelines for experimenting with new data, trying out new models, or cleaning up new data sources, and they would prefer to do this visually.
Pachyderm’s new pipeline builder, which is part of the Pachyderm dashboard, let’s data scientists quickly create and deploy data pipelines via a graphical control plane. This lets data scientists focus on data sets and associated processing, while Pachyderm handles all of the deployment and scheduling details under the hood. They can select what data they want to process and then utilize any data science language/framework to perform that processing, whether that be scikit-learn, PyTorch, ggplot, or TensorFlow.
You can find out more about the pipeline builder and other Pachyderm Enterprise features here.
New structure for organizing versioned data
There’s a lot riding on production data science pipelines. Whether that’s a modeling pipeline predicting fraudulent financial transactions or a series of data aggregations that gives visibility into company sales. The triggering, updating, and management of these pipelines needs to be rock solid as data changes and code is updated.
Pachyderm 1.7 makes a number of updates to the underlying structure and organization of our versioned data, which make our data pipelines even more robust. When pipelines need to reprocess data, they will now only reprocess the most recent version of that data. In this way, pipelines become immune to previous states of the data that might have included corrupt, or otherwise bad, data. In addition, any change to input data creates an internal metadata structure relating that change to downstream collections of data that are dependent on that change. This allows Pachyderm to manage pipeline dependencies for both data and processing in a unified and resilient manner.
Official support for the Pachyderm Python client
Data scientists love Pachyderm, and data scientists love Python. So we decided it was time to officially support the Pachyderm Python client that was started as a user contributed project (special thanks to our users kalugny and frankhinek for their contributions).
The Pachyderm python client will now be integrated into our internal CI builds and will be maintained such that it is up-to-date with our latest API. This will allow data scientists to more easily iterate on their pipelines and manage Pachyderm resources. For example, they can now quickly pull versioned data from Pachyderm into Jupyter notebooks for experimentation and integrate pipeline triggering and results into any Python application.
Note, Pachyderm is still completely language agnostic, and we aren’t forcing anyone to use Python. However, this will be a great boost for the many users already integrating Pachyderm with their Python applications!
Check out these docs for more information on the Python client.
More granular pipeline controls
As we help more and more data science and engineering teams scale their data pipelines, we discover trends related to how teams want to customize their pipelines. Pachyderm 1.7 gives scientists and engineering more granular controls for pipelines based on these trends.
Pachyderm 1.7 give data scientists/engineers more control over resources needed for any particular pipeline stage. They can set resource “limits” for pipelines to control the amount of memory, cpu usage, and gpu usage that is a pipeline is allowed to consume. They can also set resources “requests,” such that pipeline workers are scheduled on nodes that have certain resources available.
Further, Pachyderm 1.7 allows data scientists to set timeouts for processing certain jobs and data. This is super valuable for data scientists that run compute intensive jobs like model training on expensive resources like GPUs. These data scientists can rest easy knowing that their jobs are time boxed, and teams can leverage these timeouts to make sure that shared resources are optimally utilized.
Install Pachyderm 1.7 Today
- Join our Slack team for questions, discussions, deployment help, etc.
- Read our docs.
- Check out example Pachyderm pipelines.
- Connect with us on Twitter.
Finally, we would like to thank all of our amazing users who helped shaped these enhancements, file bug reports, and discuss Pachyderm workflows and, of course, all the contributors who helped us realize 1.7!