Pachyderm v0.7: Replication, Automatic Failover, and Tests

aka: RAFT <— see what I did there? ☺

Published in

Pachyderm Community Blog

2 min readMay 18, 2015

In distributed systems, data redundancy and failure tolerance are really hard problems to solve correctly. As recently as 3 years ago, every company or project would have to implement replication and failover themselves. But these days, we’ve finally reached a point where there are enough amazing off-the-shelf tools, such as CoreOS, Kubernetes, and Mesos, that solve these problems right out of the box.

At Pachyderm, we chose to use CoreOS’s distributed systems primitives to build out basic replication and automatic failover in less than a month of development. Deployment on Kubernetes or Mesos will eventually be available too.

Replication

Pachyderm achieves its fault tolerance and availability by leveraging the scheduler beneath it. Data is replicated by scheduling multiple copies of the shard service for that piece of data. For example, to store three copies of all your data, you’d simply schedule three copies of every shard service. Once scheduled, each service will automatically discover its peers in the cluster using etcd and get caught up to date on the shard it’s supposed to be replicating. Keep in mind that data in pfs is replicated based on commits, so data that hasn’t been committed yet will only have one copy stored. To make scheduling shards easy, we’ve added in a new deploy utility for setting up pfs clusters. Here’s how it works.

Automatic Failover

In order to have automatic failover, Pachyderm assumes the existence of a consensus protocol service running in the cluster. Right now, etcd is the only supported service, but we’ll be adding others soon. In the event of a netsplit, whichever side has a quorum of etcd nodes will continue to be the working cluster. Using etcd’s locking protocol, Pachyderm automatically reelects master nodes in the case of a machine failures. You can read more about etcd on CoreOS’s site.

Testing Suite

v0.7 also ships with a much more rigorous testing suite for both the file system and analytics engine. If you’re building on top of Pachyderm, you can run the tests with: scripts/launch_test

Thank you to Robert Winslow and Brendan Ashworth who both made significant contributions to this release.

We’re Hiring!

Pachyderm is just founders right now and we’re looking for our first hire. If you like ambitious distributed systems problems and think there should be a better alternative to Hadoop, please reach out. Our codebase is written in Go, but Go experience isn’t required. Email jobs@pachyderm.io.

You can also read more about our vision here.