Project Pachyderm Aims to Build a "Modern" Hadoop on Docker

A new startup named Pachyderm aims to make big data analysis simpler by providing a MapReduce alternative to the de-facto Hadoop stack. In a blog post titled Let’s build a modern Hadoop, Co-founder Joey Zwicker explains the motivation behind providing an alternative to Hadoop:

Hadoop has an irreparably fractured ecosystem.

Modern open source projects espouse the Unix philosophy of “Do one thing. Do it really well. Work together with everything around you.” Every single one of the projects mentioned above has had a clear creator behind it from day one, cultivating a healthy ecosystem and giving the project direction and purpose. In a flourishing ecosystem, everything integrates together smoothly to offer a cohesive and flexible stack to developers.

Hadoop never had any of this. It was released into a landscape with no cluster management tools and no single entity guiding it’s direction. Every major Hadoop user had to build the missing pieces internally. Some were contributed back to the ecosystem, but many weren’t. Facebook, probably the biggest Hadoop deployment in the world, forked Hadoop six years ago and have kept it closed source.

Pachyderm uses Docker, CoreOS's Fleet and Etcd, and Linux, as its primary building blocks. The following diagram shows how its stack compares against Hadoop's stack.

Job Platform

Unlike Hadoop, which uses the JVM is data processing engine, Pachyderm uses Docker containers instead. You give Pachyderm a Docker image and it will automatically distribute it throughout the cluster next to your data. According to Joey Zwicker, using Docker opens up some interesting use-cases, such as vision processing, that would be a challenge for Hadoop.

Docker, on the other hand, completely abstracts away any language constraints or library dependencies. Instead of needing a JVM-specific tool, you can use any libraries and just wrap them in a Docker container. For example, you can `npm install opencv` and suddenly you’re performing computer vision on petabytes of data! Tools and libraries can be written in any language so it’s ridiculously easy to integrate advances in technology into the Pachyderm stack.

Distributed Storage

Pachyderm replaces HDFS with its own distributed file system called Pachyderm File System (pfs). It draws its inspiration from git.

The entire file system is commit-based, meaning you have a complete history of every previous state of your data. Also like git, Pachyderm offers ridiculously cheap branching so that each user can have his/her own completely independent copy of the data without using additional storage. Users can develop MapReduce jobs or manipulate files in their branch without any worry of messing things up for another user.

Cluster Management

Hadoop uses YARN and Zookeeper for job scheduling and cluster resouce management. Pachyderm uses CoreOS tools Fleet and Etcd to perform these functions instead.

Fleet is a scheduler that figures out where to run services based on resource availability. Etcd is a fault-tolerant datastore that stores configuration information and dictates machine behavior during a net split. If a machine dies, Etcd registers that information and tells Fleet to reallocate all processes that were running on that machine. Other cluster management tools such as Mesos and Kubernetes can be used instead of CoreOS, but they aren’t officially supported yet.

Reactions to the post on Hacker News were mixed:

The JVM runs every language I care about, in a mature system with a simple, stable interface. It has its problems (startup time), but for the kind of jobs you use Hadoop for it's rarely an issue.
Switching to Docker feels like a real step backwards... lmm

"This sounds promising.
I'd love to see a Haskell-centric (performance and robust code) but ultimately language-agnostic alternative to the JVM-heavy stack that's currently in vogue..." michaelochurch

"I agree with your findings on Hadoop. It is a very complicated ecosystem. The core is tied to JVM and abstractions of map-reduce are heavy and difficult to use. There are multiple domain specific languages/api to do the same thing : pig, cascading. Yet if you pick one you cant use other. Hadoop feels like tower of babel.
But I am not in the favour of creating another stack as Hadoop replacement. Spark fills the role of distributed computation framework very well..." bipin_nag

Pachyderm is currently in alpha. You can view or contribute to their Github project here.

InfoQ Software Architects' Newsletter

Write for InfoQ

Job Platform

Distributed Storage

Cluster Management

Rate this Article

This content is in the Infrastructure topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter