Pachyderm 1.5: GPU Support, UI, Expanded Pipeline Functionality, Auto-scaling, and more.
For those new to the project, Pachyderm is an open source system for distributed data pipelining and data versioning. Pachyderm lets you create data pipelines composed of any languages/frameworks, version the data input and output of every stage of these pipelines, and track the full “provenance” of any results.
As we have gained a ton of new users this release cycle, it’s exciting to see Pachyderm powering production-scale machine learning, analytics, scientific research, distributed ETL, and much more.
Some of the major improvements the Pachyderm 1.5 release include:
- The Pachyderm UI — The brand new Pachyderm UI gives you insight into your DAG, data repositories, jobs, and more.
- Resource Specification, Including GPU Support — You can now specify the resources needed for individual pipeline stages, including specifying certain pipelines stages that should be executed on GPU nodes.
- Expanded Data Combinations — If you have multiple inputs to your pipeline, you can now combine those inputs in a variety of interesting ways.
- Auto-scaling — Pipelines workers can now be auto-scaled down when they are idle.
- Efficient Data Management — Shuffling and copying data is now much more space efficient, and you can now garbage collect your deleted files, data, and commits.
- Enhanced Incremental Processing —A special feature called “incremental” get you massive performance improvements for certain workloads.
The Pachyderm UI
With the Pachyderm 1.5 UI, or “dashboard,” you can:
- Explore your versioned data — interactively explore various “data repositories” that organize and manage versions of the data flowing through your pipelines.
- Visualize your DAG —automatically visualize the structure of your declared DAG pipeline and analyze it interactively.
- Track your pipelines —investigate pipeline statuses, runs, and details (e.g., Docker images and commands associated with pipelines).
The Pachyderm UI is a feature that is helping enhance Pachyderm for true enterprise usage. As such, the UI will be part of a new Pachyderm Enterprise Edition that focuses on production use cases. For more information on Pachyderm Enterprise Edition, please email us at email@example.com or chat with us on our public Slack.
Resource Specification, Including GPU Support
Pachyderm 1.5 allows you to accelerate your model training and/or better schedule compute intensive pipelines. For example, if you were developing a machine learning pipeline, you might have a training stage, scoring or inference stage, visualization stage etc. With Pachyderm 1.5, you can optionally offload the training stage of that ML pipeline to a GPU node for big performance gains.
More generally, you can specify exact CPU, GPU, and/or memory resources for any Pachyderm 1.5 pipeline. This ensures that pipelines are scheduled efficiently and with enough resources, which is particularly important as your data science/engineering organization grows and must share resources across a cluster.
Expanded Data Combinations and Management
Pachyderm 1.5 makes combining data sources easier and minimizes inefficient data transfers.
Pachyderm 1.5 allows you to combine data from various sources using the flexible and familiar primitives
union. For example, if you need to test ML models across a huge number of parameters, you could “cross” your training data with your parameters and distribute the testing for all combinations of those parameters. This reduces the time needed to set up distributed processing of various data sources (e.g., for parameter tuning) and let’s data scientists focus their time on model development.
In addition, Pachyderm 1.5 takes space efficient data management to a whole new level. For workflows that require you to shuffle data (e.g., arranging into time-windowed buckets) or copy data from one repository to another, Pachyderm 1.5 let’s you perform those shuffles or copies without creating any duplicate data. This minimizes network traffic and reduces inefficient data transfers. Pachyderm 1.5 also gives you explicit control over garbage collecting deleted files, data repositories, commits, etc.
Pachyderm 1.5 reduces the cost of and contention for cluster resources.
Pachyderm 1.5 adds full support for auto-scaling at the Pachyderm worker level that can complement cloud auto-scaling. Pachyderm 1.5 allows you to specify a threshold, which will let Pachyderm scale down idle workers after a certain period of time.
This scale down of active workers can dramatically reduce the cost of resources when you are processing bursts of data and/or when you are performing large distributed batch jobs one a day, one a month, etc. You can scale up Pachyderm workers automatically when you need them and scale them down when they are idle.
Install Pachyderm 1.5 Today
- Join our Slack team for questions, discussions, deployment help, etc.
- Read our docs.
- Check out example Pachyderm pipelines.
- Connect with us on Twitter.
Finally, we would like to thank all of our amazing users who helped shaped these enhancements, file bug reports, and discuss Pachyderm workflows and, of course, all the contributors who helped us realize 1.5!