Pachyderm 1.5: GPU Support, UI, Expanded Pipeline Functionality, Auto-scaling, and more.

Daniel Whitenack
Pachyderm Community Blog
4 min readJul 7, 2017

Today, we’re pleased to announce Pachyderm 1.5! Install it now or migrate your existing Pachyderm deployment.

For those new to the project, Pachyderm is an open source system for distributed data pipelining and data versioning. Pachyderm lets you create data pipelines composed of any languages/frameworks, version the data input and output of every stage of these pipelines, and track the full “provenance” of any results.

As we have gained a ton of new users this release cycle, it’s exciting to see Pachyderm powering production-scale machine learning, analytics, scientific research, distributed ETL, and much more.

Some of the major improvements the Pachyderm 1.5 release include:

  • The Pachyderm UI — The brand new Pachyderm UI gives you insight into your DAG, data repositories, jobs, and more.
  • Resource Specification, Including GPU Support — You can now specify the resources needed for individual pipeline stages, including specifying certain pipelines stages that should be executed on GPU nodes.
  • Expanded Data Combinations — If you have multiple inputs to your pipeline, you can now combine those inputs in a variety of interesting ways.
  • Auto-scaling — Pipelines workers can now be auto-scaled down when they are idle.
  • Efficient Data Management — Shuffling and copying data is now much more space efficient, and you can now garbage collect your deleted files, data, and commits.
  • Enhanced Incremental Processing —A special feature called “incremental” get you massive performance improvements for certain workloads.

The Pachyderm UI

With the Pachyderm 1.5 UI, or “dashboard,” you can:

  • Explore your versioned data — interactively explore various “data repositories” that organize and manage versions of the data flowing through your pipelines.
  • Visualize your DAG —automatically visualize the structure of your declared DAG pipeline and analyze it interactively.
  • Track your pipelines —investigate pipeline statuses, runs, and details (e.g., Docker images and commands associated with pipelines).

The Pachyderm UI is a feature that is helping enhance Pachyderm for true enterprise usage. As such, the UI will be part of a new Pachyderm Enterprise Edition that focuses on production use cases. For more information on Pachyderm Enterprise Edition, please email us at support@pachyderm.io or chat with us on our public Slack.

Resource Specification, Including GPU Support

Pachyderm 1.5 allows you to accelerate your model training and/or better schedule compute intensive pipelines. For example, if you were developing a machine learning pipeline, you might have a training stage, scoring or inference stage, visualization stage etc. With Pachyderm 1.5, you can optionally offload the training stage of that ML pipeline to a GPU node for big performance gains.

More generally, you can specify exact CPU, GPU, and/or memory resources for any Pachyderm 1.5 pipeline. This ensures that pipelines are scheduled efficiently and with enough resources, which is particularly important as your data science/engineering organization grows and must share resources across a cluster.

Expanded Data Combinations and Management

Pachyderm 1.5 makes combining data sources easier and minimizes inefficient data transfers.

Pachyderm 1.5 allows you to combine data from various sources using the flexible and familiar primitives cross and union. For example, if you need to test ML models across a huge number of parameters, you could “cross” your training data with your parameters and distribute the testing for all combinations of those parameters. This reduces the time needed to set up distributed processing of various data sources (e.g., for parameter tuning) and let’s data scientists focus their time on model development.

In addition, Pachyderm 1.5 takes space efficient data management to a whole new level. For workflows that require you to shuffle data (e.g., arranging into time-windowed buckets) or copy data from one repository to another, Pachyderm 1.5 let’s you perform those shuffles or copies without creating any duplicate data. This minimizes network traffic and reduces inefficient data transfers. Pachyderm 1.5 also gives you explicit control over garbage collecting deleted files, data repositories, commits, etc.

Auto-scaling

Pachyderm 1.5 reduces the cost of and contention for cluster resources.

Pachyderm 1.5 adds full support for auto-scaling at the Pachyderm worker level that can complement cloud auto-scaling. Pachyderm 1.5 allows you to specify a threshold, which will let Pachyderm scale down idle workers after a certain period of time.

This scale down of active workers can dramatically reduce the cost of resources when you are processing bursts of data and/or when you are performing large distributed batch jobs one a day, one a month, etc. You can scale up Pachyderm workers automatically when you need them and scale them down when they are idle.

Install Pachyderm 1.5 Today

For more details check out the changelog. To try the new release for yourself, install it now or migrate your existing Pachyderm deployment. Also be sure to:

Finally, we would like to thank all of our amazing users who helped shaped these enhancements, file bug reports, and discuss Pachyderm workflows and, of course, all the contributors who helped us realize 1.5!

--

--