Pachyderm 1.2: Stability, Performance and Usability Improvements
Pachyderm 1.2 is out today and with it a host of new features. Our biggest goal with this release was to stabilize our existing code base so that it can keep up with the production use we’re seeing. This required an architectural shift for our filesystem, PFS. Our pipelining system, PPS, remains largely unchanged. We spent the rest of our development cycle extending our API to address some use cases that it wasn’t able to handle before and benchmarking our system to find and remove performance bottlenecks. Finally, with 1.2 we’re introducing our developer docs portal.
Ready to take Pachyderm 1.2 for a spin? Head on over to the Getting Started guide.
We gave PFS, the version controlled filesystem that underpins Pachyderm, a major overhaul in 1.2. In 1.1, PFS nodes held system state and could encounter issues when nodes died or containers got rescheduled. Stateful distributed system are tricky to get right, especially in containerized environments where a process can get rescheduled on different machines. As of 1.2, PFS is completely stateless — all the state that we previously stored locally in PFS nodes is now stored in a database (RethinkDB) which we deploy as part of our Kubernetes manifest.
PFS also got a few new features as part of this overhaul. Pachyderm now gives commits semantically meaningful names, master/0, master/1 etc. This makes the commit lineage much more intuitive. Users can now merge commits. This enables a number of usage patterns that weren’t possible before and also allows PPS (which runs on top of PFS) to smartly rollback failed jobs and retry them.
Finally, we added a benchmarking suite in this release and started systematically testing Pachyderm’s scalability. Pachyderm now caches some of its costlier requests, particularly those that would have to go to object storage or disk to get data. (Shout out to Brad Fitzpatrick for GroupCache, which made adding a caching layer incredibly painless.) Expect 1.3 to involve a lot more of this type of testing and many more improvements to Pachyderm’s scalability.
One of the biggest complaints we heard about 1.1 was that it was hard to iterate on pipelines. To update a pipeline, our users had to delete it, delete its data and then redeploy it. This could take a while and slowed down iteration. Users also often wanted to update a pipeline without deleting the existing output, as outside clients might have references to it which would become defunct if it was removed. To deal with this, we added UpdatePipeline, which allows you to modify an existing pipeline. The pipeline’s output is “archived,” which means that it remains readable by explicit reference but isn’t shown in ListCommit or FlushCommit. Users of those commands will always see the most up-to-date version of the data.
Deploying Pachyderm normally involves 3 steps:
- Install pachctl, the Pachyderm CLI, locally
- Deploy pachd, the Pachyderm daemon, to your Kubernetes cluster
- Get pachctl and pachd talking to each other
This was harder than it needed to be in 1.1 and tripped a number of users up. People wound up with mismatched versions of pachctl and pachd, which led to compatibility issues. Furthermore, getting pachctl talking to pachd proved difficult, since different users needed to follow different processes. In 1.2, pachctl does most of the heavy lifting for you. As long as kubectl is working on your system, the steps now look like this:
- Install pachctl
- Run pachctl deploy, which automatically deploys a matching version of pachd
- Run pachctl port-forward, which automatically forwards your local port to the Pachyderm cluster.