Pachyderm v0.5 release: Single Node Mode

Joey Zwicker
Pachyderm Community Blog
2 min readMar 16, 2015

When building any new open source tool, the most important thing is getting users to that magical “wow” moment as quickly as possible. Even if the product has some rough edges (in our case many), you can still create a great user experience by making those first few moments nice and smooth.

For a distributed analytics product like Pachyderm, large-scale deployment is always going to require some effort, but wouldn’t it be great if you could try out the product locally before dealing with a distributed deployment?

Introducing Pachyderm Single Node Mode (internally referred to as “the more tightly pached derm”)!

Pachyderm v0.5 includes a script to let you run pfs locally on a single machine. With just a single line of code, you can be all ready to start snapshotting your data and writing MapReduce jobs.

Obviously, running Pachyderm locally totally defeats the purpose of a distributed analytics framework, but it’s a great way to do some weekend computation over smaller data sets or to test jobs locally before you push them off to the cluster. The only dependencies are Docker and btrfs-tools.

Getting started

We’ve completely streamlined the process for you to start analyzing data in minutes. We’ve also included all the chess data we used in our recent post and made it publicly available in S3 for you to enjoy.

Run a local Pachyderm instance with sample data

Step 1: Launch a local pfs shard

Download and run the Pachyderm launch script to get a local instance running.

Step 2: Clone the chess pipeline

Clone the chess git repo we’ve provided. You can check out the full map code on GitHub.

Step 3: Install the pipeline locally and run it

Run the local install script to start the pipeline. It should take around 6 minutes.

You can view the running processes using htop:

htop shows the Stockfish processes running. Stockfish is the open source chess engine used for analyzing the games

Each line of the results will be in the following format:

Each file is a game with each line corresponding to a move’s worth of analysis.

Next steps:

Now that you have the data mapped, do whatever analysis you want!

  • You can store your own chess games in pgn format, change the input in the job descriptor, and see what the chess engine thinks of your play!
  • Run more than the default limit (5) chess games, change the input directory to s3://pachyderm-data/chess, write a reduce job, and come up with your own insights about professional chess!
  • Spin up a full-blown cluster and crank through every chess game in history or your own data set!

Other happenings at Pachyderm:

We’ve also been rapidly pushing the bounds of scalability with Pachyderm doing some data analysis of our own. We’ve fixed a ton of bugs in the MapReduce pipeline and added option arguments to the job descriptors including `limit` and `parallel`. A followup to our chess demo is also the works. ☺

--

--

Joey Zwicker
Pachyderm Community Blog

Founder at Pachyderm.com. I love data, dota, and basically anything else of the form d*ta.