Let’s build a modern Hadoop

Not just another Hadoop rant — I promise

If you’ve been around the big data block, you’ve probably felt the pain of Hadoop, but we all still use it because we tell ourselves, “that’s just the way infrastructure software is.” However, in the past decade, infrastructure tools ranging from NoSQL databases, to distributed deployment, to cloud computing have all advanced by orders of magnitude. Why have large-scale data analytics tools lagged behind? What makes projects like Redis, Docker and CoreOS feel modern and awesome while Hadoop feels ancient?

Hadoop has an irreparably fractured ecosystem.

Modern open source projects espouse the Unix philosophy of “Do one thing. Do it really well. Work together with everything around you.” Every single one of the projects mentioned above has had a clear creator behind it from day one, cultivating a healthy ecosystem and giving the project direction and purpose. In a flourishing ecosystem, everything integrates together smoothly to offer a cohesive and flexible stack to developers.

Hadoop never had any of this. It was released into a landscape with no cluster management tools and no single entity guiding it’s direction. Every major Hadoop user had to build the missing pieces internally. Some were contributed back to the ecosystem, but many weren’t. Facebook, probably the biggest Hadoop deployment in the world, forked Hadoop six years ago and have kept it closed source.

This is not how modern open source is supposed to work. I think it’s time to create a modern Hadoop and that’s exactly what we’re trying to do at Pachyderm. Pachyderm is a completely new storage and analytics engine built on top of modern tools. The biggest benefit of starting from scratch is that we get to leverage amazing advances in open source infrastructure, such as Docker and Kubernetes.

This is why we can build something an order of magnitude better than Hadoop. Pachyderm can focus on just the analytics platform and use powerful off-the-shelf tools for everything else. When Hadoop was at this stage, they had to build everything themselves, but we don’t. The rest of this essay is our blueprint for a modern data analytics stack. Pachyderm is still really young and open source projects need healthy discussion to continue improving. Please share your opinions and help us build Pachyderm!


Blue outline denotes components built by Pachyderm.

NOTE: The Hadoop ecosystem has been around for 10 years and is very mature. It will be a while before Pachyderm has analogs for everything in the ecosystem (e.g. Hive, Pig). This comparison will be restricted to just the distributed file system, analytics engine, and directly related components that are present in both systems.


Distributed Processing

In Hadoop, MapReduce jobs are specified as Java classes. That’s fine for Java experts, but isn’t for everyone. There are a number of different solutions available that allow the use of other languages, such as Hadoop streaming, but in general, if you’re using Hadoop extensively, you’re going to be doing work in Java (or Scala).

Job pipelines are also a constant challenge with distributed processing. While Hadoop MapReduce shows actively running jobs, it doesn’t natively have any notion of a job pipeline (DAG). There are lots of job-scheduling tools that have tried to solve this problem to varying degrees of success (e.g. Chronos, Oozie, Luigi, Airflow), but ultimately, companies wind up using a mishmash of these and home-brewed solutions. The complexity of mixing custom code with outside tools becomes a constant headache.

Contrast this with Pachyderm Pipelines. To process data in Pachyderm, you simply create a containerized program which reads and writes to the local filesystem. You can use any tools you want because it’s all just going in a container. Pachyderm will inject data into your container by way of a FUSE volume and then automatically replicate the container, showing each one a different chunk of data. With this technique, Pachyderm can scale any code you write to process massive data sets in parallel. No more dealing with Java or JVM-based abstraction layers, just write your data processing logic in your favorite language with any of your favorite libraries. If it fits in a Docker container, you can use it for data analysis.

Pachyderm also creates a DAG for all the jobs in the system and their dependencies and it automatically schedules the pipeline such that each job isn’t run until it’s dependencies have completed. Everything in Pachyderm “speaks in diffs” so it knows exactly which data has changed and which subsets of the pipeline need to be rerun.


Job platform

Comparing Docker to the JVM is a bit of a stretch. We’ve categorized them as the “job platform” because they define the input format for jobs.

The JVM is the backbone of the Hadoop ecosystem. If you want to build anything in Hadoop, you need to either write it in Java or use a special-purpose tool that creates an abstraction layer between the JVM and another language. Hive, which is a SQL-like interface to HDFS, is by far the most popular and well-supported. There are also third-party libraries for common use cases such as image processing, but they are often far less standardized and poorly maintained. If you’re trying to do something more esoteric, such as analyzing chess games, you’re generally out of luck or need hack together a few different systems.

Docker, on the other hand, completely abstracts away any language constraints or library dependencies. Instead of needing a JVM-specific tool, you can use any libraries and just wrap them in a Docker container. For example, you can `npm install opencv` and Pachyderm will let you do computer vision on petabytes of data! Tools can be written in any language so it’s ridiculously easy to integrate open source technology advances into the Pachyderm stack.

Finally, Pachyderm analysis pipelines are both portable and shareable. Since everything is bundled in a container, it’s guaranteed to run in a predictable way across different clusters or datasets. Just as anyone can pull the Redis container from DockerHub and immediately be productive, imagine being able to download an NLP container — just put text in the top and get sentiment analysis out the bottom. It just works out of the box on any infrastructure! That’s what we’re creating with Pachyderm pipelines.


Distributed Storage

HDFS is one of the most stable and robust elements of the Hadoop ecosystem. It’s great for storing massive data sets in a distributed fashion, but it lacks one major feature — collaboration. Large-scale data analysis and pipelining is a naturally collaborative effort, but HDFS was never designed to be used concurrently by a company’s worth of people. Rather, it entails a great deal of jerry-rigging to keep users from stepping on each other’s toes. It’s unfortunately quite common for a job to break or change because someone else alters the pipeline upstream. Every company solves this internally in different ways, sometimes with solutions as rudimentary as giving each user their own copy of the data, which requires a ton of extra storage.

The Pachyderm File System (pfs) is a distributed file system that draws inspiration from git, the de facto tool for code collaboration. In a nutshell, pfs gives you complete version control over all your data. The entire file system is commit-based, meaning you have a complete history of every previous state of your data. Also like git, Pachyderm offers ridiculously cheap branching so that each user can have his/her own completely independent view of the data without using additional storage. Users can develop analytics pipelines or manipulate files in their branch without any worry of messing things up for another user.

Pfs stores all your data in generic object storage (S3, GCS, Ceph, etc). You don’t have to trust your data in some new technology. Instead, you get all the redundancy and persistence guarantees you’re used to, but with Pachyderm’s advanced data management features (see Unix philosophy above).

Version control for data is also very synergistic with our pipelining system. Pachyderm understands how your data changes and thus, as new data is ingested, it can run your workload on only the diff of the data rather than the whole thing. Not only is cluster performance dramatically improved, but there’s also no difference in Pachyderm between a batched job and a streaming job, the same code and infrastructure will work for both!


Cluster Management

The clustering layer is comprised of tools that let you manage the machines used for data storage and processing.

In Hadoop, the two main tools that address this are: YARN, which handles job scheduling and cluster resource management; and Zookeeper, which provides highly-reliable configuration synchronization. At Hadoop’s conception, there weren’t any other good tools available to solve these problems, so YARN and Zookeeper became strongly coupled to the Hadoop ecosystem. While this contributed to Hadoop’s early success, it now presents a significant obstacle to adopting new advances. This lack of modularity is, unequivocally, one of Hadoop’s biggest weaknesses.

Pachyderm ascribes to Docker’s philosophy of “batteries included, but removable.” We focus on doing one thing really well — analyzing large data sets — and use off-the-shelf components for everything else. We chose Kubernetes for our cluster management and Docker as our containerization format, but these are interchangeable with variety of other options.

In the Pachyderm stack, clustering is handled by Kubernetes and the CoreOS tool Etcd, which serve similar purposes to YARN and Zookeeper, respectively. Kubernetes is a scheduler that figures out where to run services based on resource availability. Etcd is a fault-tolerant datastore that stores configuration information and dictates machine behavior during a net split. If a machine dies, Etcd registers that information and tells Kubernetes to reallocate all processes that were running on that machine. Other cluster management tools, such as Mesos, can be used instead of CoreOS and Kubernetes, but they aren’t officially supported yet.

Using off-the-shelf solutions has two big advantages. First, it saves us having to write our own versions of these tools and gives us a clean abstraction layer. Second, Etcd and Kubernetes are themselves designed to be modular, so it’s easy to support a variety of other deployment methods as well.


Operating system

Both stacks run on Linux. There’s nothing too interesting to say here.


Conclusion

Scalable, distributed, data analytics tools are a fundamental piece of software infrastructure. Modern web companies are collecting ever-increasing amounts of data and making more data-oriented decisions. The world needs a modern open source solution in this space and Hadoop’s archaicness is becoming a burden on an increasingly data-driven world. It’s time for our data analytics tools to catch up to the future.

If you’d like to get in touch, you can email me at joey@pachyderm.io. Or find us on GitHub: github.com/pachyderm/pachyderm

Thanks to Joe Doliner and Gabe Dillon for reading drafts of this.