Pachyderm 1.6: Periodic Job Execution, Access Control, Advanced Statistics, Extended UI, and more.

Daniel Whitenack
Oct 4, 2017 · 4 min read

Today, we’re pleased to announce Pachyderm 1.6! Install it now or migrate your existing Pachyderm deployment.

For those new to the project, Pachyderm is an open source and enterprise data science platform that is enabling reproducible data processing at scale. Pachyderm lets you create data pipelines in any languages/frameworks, version the data inputs and outputs of every stage of these pipelines, and track the full provenance of any results.

Running TensorFlow in Pachyderm for object detection

Pachyderm 1.6 ushers in important security and management features for large-scale enterprise usage along with features that demonstrate a continued commitment to a robust and flexible open source

Some of the major improvements the Pachyderm 1.6 release include:

  • Periodic Job Execution — Trigger pipelines periodically using built-in cron for web scraping, data queries, and more.
  • Access Controls — Maintain compliance and/or manage data for large data science teams.
  • Advanced Statistics — Quickly track down the root cause of pipeline failures and analyze where you are spending compute or I/O time.
  • Extended UI/Dashboard functionality — Get to the information you need quickly and easily with extended dashboard functionality.

Periodic Job Execution

Pachyderm 1.6 makes periodic execution a breeze for things like web scraping, database queries, and more! You can now create a cron input to a pipeline, and the cron input will trigger the pipeline periodically. This functionality let’s you avoid manual interactions or time consuming scripting. You just specify a time in a common format, and Pachyderm takes care of the rest.

If you are running this type of workflow, make sure you check out our new example that illustrates pain-free periodic execution. In the example, we utilize a cron input to periodically query a MongoDB database and analyze the results.

Access Controls

Access controls are built into Pachyderm 1.6+. When using the Enterprise Edition, members of your data science team will be able to interact with Pachyderm as unique users, and cluster admins will be able to restrict access to data on a per user basis.

This feature is an absolute must for any enterprise that is performing data processing on a large scale. You might need to comply with HIPAA or other official regulations, or you may just to need to ensure that team members are not accidentally modifying production data. Regardless, Pachyderm 1.6 gives you the tools to easily and securely manage your data and data analytics at scale.

Check out these docs for more information on Pachyderm Enterprise Edition’s access control functionality.

Advanced Statistics

When you are running hundreds or thousands of jobs everyday on constantly changing data, you need to be able to:

  • Quickly and easily pinpoint the reason why a job is failing, and
  • Analyze where you are spending the most effort in terms of compute, I/O, and total processing time.

Pachyderm 1.6’s advanced enterprise statistics give you this visibility into your jobs, pipelines, and data. You can now understand, at an extremely granular level, why your jobs are failing and/or where you are spending the most time in compute or I/O. This let’s you deal with any issues quickly and optimize your workloads, no matter how big your data gets or how complicated your workflows become.

Check out these docs for more information on Pachyderm Enterprise Edition’s advanced statistics functionality.

Extended UI Functionality

Along with the backend changes that are enabling important enterprise features, Pachyderm 1.6 also includes upgrades to the Pachyderm dashboard. These upgrades give data analysts and cluster admins a bird’s eye view of what is going on in their Pachyderm cluster, while also allowing them to easily dive into details about job performance and data management.

For example, the Pachyderm 1.6 dashboard let’s you access advanced statistics, on a per job and per data unit (or datum) level, as shown below:

You can even analyze why Pachyderm wasn’t able to process a certain piece of data, quickly retrieve related logs, and view the files that were given to a particular worker:

To find out more about the Pachyderm dashboard and related functionality, take a look at our Enterprise docs.

Install Pachyderm 1.6 Today

For more details check out the changelog. To try the new release for yourself, install it now or migrate your existing Pachyderm deployment. Also be sure to:

Finally, we would like to thank all of our amazing users who helped shaped these enhancements, file bug reports, and discuss Pachyderm workflows and, of course, all the contributors who helped us realize 1.6!

Pachyderm Data

Elephantine Analytics

Thanks to Joey Zwicker

Daniel Whitenack

Written by

Data Scientist at Pachyderm

Pachyderm Data

Elephantine Analytics

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade