A new startup named Pachyderm aims to make big data analysis simpler by providing a MapReduce alternative to the de-facto Hadoop stack. In a blog post titled Let’s build a modern Hadoop, Co-founder Joey Zwicker explains the motivation behind providing an alternative to Hadoop:
Hadoop has an irreparably fractured ecosystem.
Modern open source projects espouse the Unix philosophy of “Do one thing. Do it really well. Work together with everything around you.” Every single one of the projects mentioned above has had a clear creator behind it from day one, cultivating a healthy ecosystem and giving the project direction and purpose. In a flourishing ecosystem, everything integrates together smoothly to offer a cohesive and flexible stack to developers.Hadoop never had any of this. It was released into a landscape with no cluster management tools and no single entity guiding it’s direction. Every major Hadoop user had to build the missing pieces internally. Some were contributed back to the ecosystem, but many weren’t. Facebook, probably the biggest Hadoop deployment in the world, forked Hadoop six years ago and have kept it closed source.
Pachyderm uses Docker, CoreOS's Fleet and Etcd, and Linux, as its primary building blocks. The following diagram shows how its stack compares against Hadoop's stack.
Job Platform
Unlike Hadoop, which uses the JVM is data processing engine, Pachyderm uses Docker containers instead. You give Pachyderm a Docker image and it will automatically distribute it throughout the cluster next to your data. According to Joey Zwicker, using Docker opens up some interesting use-cases, such as vision processing, that would be a challenge for Hadoop.
Docker, on the other hand, completely abstracts away any language constraints or library dependencies. Instead of needing a JVM-specific tool, you can use any libraries and just wrap them in a Docker container. For example, you can `npm install opencv` and suddenly you’re performing computer vision on petabytes of data! Tools and libraries can be written in any language so it’s ridiculously easy to integrate advances in technology into the Pachyderm stack.
Distributed Storage
Pachyderm replaces HDFS with its own distributed file system called Pachyderm File System (pfs). It draws its inspiration from git.
The entire file system is commit-based, meaning you have a complete history of every previous state of your data. Also like git, Pachyderm offers ridiculously cheap branching so that each user can have his/her own completely independent copy of the data without using additional storage. Users can develop MapReduce jobs or manipulate files in their branch without any worry of messing things up for another user.
Cluster Management
Hadoop uses YARN and Zookeeper for job scheduling and cluster resouce management. Pachyderm uses CoreOS tools Fleet and Etcd to perform these functions instead.
Fleet is a scheduler that figures out where to run services based on resource availability. Etcd is a fault-tolerant datastore that stores configuration information and dictates machine behavior during a net split. If a machine dies, Etcd registers that information and tells Fleet to reallocate all processes that were running on that machine. Other cluster management tools such as Mesos and Kubernetes can be used instead of CoreOS, but they aren’t officially supported yet.
Reactions to the post on Hacker News were mixed:
The JVM runs every language I care about, in a mature system with a simple, stable interface. It has its problems (startup time), but for the kind of jobs you use Hadoop for it's rarely an issue.
Switching to Docker feels like a real step backwards... lmm
"This sounds promising.
I'd love to see a Haskell-centric (performance and robust code) but ultimately language-agnostic alternative to the JVM-heavy stack that's currently in vogue..." michaelochurch
"I agree with your findings on Hadoop. It is a very complicated ecosystem. The core is tied to JVM and abstractions of map-reduce are heavy and difficult to use. There are multiple domain specific languages/api to do the same thing : pig, cascading. Yet if you pick one you cant use other. Hadoop feels like tower of babel.
But I am not in the favour of creating another stack as Hadoop replacement. Spark fills the role of distributed computation framework very well..." bipin_nag
Pachyderm is currently in alpha. You can view or contribute to their Github project here.
Community comments
How would a DFS based on GIT scale to petabytes?
by peter lin,
Re: How would a DFS based on GIT scale to petabytes?
by Joey Zwicker,
Re: How would a DFS based on GIT scale to petabytes?
by peter lin,
How would a DFS based on GIT scale to petabytes?
by peter lin,
Your message is awaiting moderation. Thank you for participating in the discussion.
There are some well known systems with petabytes of data in hdfs. I seriously doubt a GIT-like distributed file system would scale as well. There's simply too much overhead with keeping full version history in a file system. I can see it work for small files less than 10mb, but not files that are over 100Mb or even 10Gb.
Re: How would a DFS based on GIT scale to petabytes?
by Joey Zwicker,
Your message is awaiting moderation. Thank you for participating in the discussion.
Pfs gets its git-like semantics from btrfs, the same copy-on-write file system that powers Docker. It's not actually built on Git. Commits and branches in pfs are extremely cheap and in practice, most companies add files regularly, but delete or update content much less frequently. For this use case, pfs has almost no overhead. If deletes or updates are more common, the only overhead are the past versions of the files your storing for a rich history. Btrfs includes garbage collection where you can specify the density of past commits you want to keep based on your needs. For example, you could store a commit every minute for the last week and then only 1 commit per day for data older than that. We haven't fully exposed this feature yet in pfs, but we're working on it.
Re: How would a DFS based on GIT scale to petabytes?
by peter lin,
Your message is awaiting moderation. Thank you for participating in the discussion.
In a situation where everything is write once and never update, I can see it working. But for any kind of data that has hundreds or thousands of versions, the storage overhead will be difficult to scale. Take Facebook's 100 Petabyte data warehouse. If storing version history adds 50 or 100x more data, that's not trivial. I can see it working for specific cases where the number of versions is less than 100 and the cost of storing delta's between versions is less than 10x the disk space. Even with that, it would only work for small files below 100Mb. When I think about version control, often users want to do arbitrary comparison between any two versions. With delta files, it's easy to show the delta between adjacent versions like 4 & 5. What if you want to compare between multiple versions with large files? The cost of doing that calculation could be extremely time consuming as well as impact IO.