Pachyderm 1.4: Performance Boosts, Simplified Data Partitioning, Deployment Flexibility, and more.
For those new to the project, Pachyderm is an open source, distributed data processing framework built on containers. It is powering language agnostic and reproducible data pipelines for machine learning, ETL, event processing, data munging, and much more.
Pachyderm 1.4 significantly improves the performance of metadata operations (some with up to 100x gains), greatly simplifies and expands data partitioning capabilities, and provides more deployment flexibility. Some of the major improvements and new features in the 1.4 release include:
- Performance boosts — Performance for common metadata operations has been vastly improved.
- Simplified/enhanced data partitioning — We now have a much simpler and more powerful way to specify how input data should be partitioned.
- “Custom” deploy options — You can now easily back Pachyderm by any S3 compatible object store (e.g., Minio) using “custom” deploy commands.
- Updated deployment dependencies — etcd, which was previously only used for consensus, is taking over for RethinkDB as our metadata store.
Pachyderm workflows often include a bunch of metadata operations such as listing files or repositories, creating repositories, creating pipelines, etc. Well, in Pachyderm 1.4 the performance of common metadata operations has been vastly improved! This is especially true for repos with a deep commit history. In Pachyderm 1.3, the time required for operations such as listing files was O(n), where n is the number of commits. In Pachyderm 1.4, these operations are O(1). We have seen this result in up to 100x improvements in the speed of metadata operations.
Simplified/enhanced data partitioning
We now have a much simpler and more powerful way to specify how input data should be partitioned for distributed pipelines. In Pachyderm 1.4, you specify how data is partitioned in your pipelines via a simple “glob” pattern. Pachyderm uses the glob pattern to determine how many individual pieces of data, or “datums,” are in an input data set. Pachyderm will then split processing across these datums. This leads to much more granular computations, which reduces memory/disk pressure and makes checkpointing more fine grained. This model also aids in deduplication features, because the system is smart enough to never reprocess the same datum with the same code.
However, those are just some of the great enhancements that have been enabled by the 1.4 processing model. It also enables users to:
- Correctly process modifications and deletions of data (along with additions of data). This implies that fixing wrong (incorrect, bad, or corrupt) input data is now just a matter of making a new commit that corrects the wrong data. In 1.3, it was much harder to recover once you’d committed bad data.
- Easily update datasets, like training data sets for machine learning models, in non-additive ways, and Pachyderm will continue to keep you pipelines in sync with that updated data.
- Partition data by non-top-level files.
- Only process a certain subdirectories within an input repo.
- Provide a pipeline with multiple inputs from the same repo.
“Custom” deploy options
Pachyderm runs on top of Kubernetes and an object store of your choice. However, prior to 1.4 it wasn’t straightforward to utilize any object store, other than the object store provided by the major cloud providers.
Pachyderm is very happy to announce a collaboration with Minio, which has greatly increased our deploy flexibility. The team at Minio contributed a new object store client, available in Pachyderm 1.4, that allows Pachyderm to easily be backed by any S3 compatible object store or datastore, such as Minio, Swift, Ceph, etc.
Moreover, we have added “custom” deploy commands to help you get Pachyderm deployed with your own custom combination of Kubernetes and S3 compatible object store.
Updated deployment dependencies
Pachyderm 1.4 users will note that we have one fewer dependency in our deploy. We have removed RethinkDB for a variety of reasons, but, in general, users will see faster deploys as a result and will be able to take advantage of the new processing model discussed in the “Simplified/enhanced data partitioning” section above.
In 1.4, we move functionality previously enabled by RethinkDB to etcd (which were already using for consensus), but we have also become much smarter with how we leverage the object store and with how we structure metadata. These changes make Pachyderm metadata interactions a perfect use case for a key-value store like etcd, and allow us to take advantage of etcd’s true watch semantics and multi-document transactions.
Install Pachyderm 1.4 Today
- Join our Slack team for questions, discussions, deployment help, nerdy jokes, etc.
- Read our docs.
- Check out example Pachyderm pipelines.
- Connect with us on Twitter.
Finally, we would like to thank all of our amazing users who helped shaped these enhancements, file bug reports, and discuss Pachyderm workflows and, of course, all the contributors, including the team at Minio, who helped us realize 1.4!