The competition between two popular BigTable open source implementation - HBase and Cassandra accelerates with a new offering from DataStax, namely Brisk, a Hadoop implementation based on Cassandra. According to DataStax vice president of products Ben Werther:
The idea is to offer a single platform that provides both a low-latency database for "realtime" web-scale applications and the sort of heavy data analysis you get with Hadoop. One thing we're hearing from [enterprises] is that they need the complete Big Data picture, from realtime low-latency applications through to tools that analyze data - and the ability to use those tools to actually feed data back into applications .
Tim Estes, CEO of Digital Reasoning explains further, that:
By marrying the power of Cassandra - including its simplicity, scalability and speedy reads / writes - to Hadoop, DataStax has created a powerful system that speeds up the time between data creation and analysis. We can count on some of Cassandra's unique capabilities to aid projects that have multiple datacenter locations and large and complex bulk ingest demands. We've been thrilled to work with the DataStax team to push its capabilities into some of the most demanding customers- particularly in the Defense and Intelligence Community.
While the original creators of Cassandra - Facebook - seem to move away from it to HBase for their social mail product primarily due to its strong consistency features, DataStax has gone the opposite way, pairing Cassandra with Hadoop. According to Ben Werther - VP of Products at DataStax:
HBase is less mature than Cassandra, and it's built on HDFS, which has scalability and reliability challenges... Cassandra can serve all of the functions of that lower level part of the Hadoop stack, but at the same time give you low-latency realtime application capabilities in that same infrastructure. What's more Cassandra is designed in such a way that you can have part of your Brisk infrastructure focus on analytics while another handles low-latency applications. You can use it as a realtime infrastructure as you write queries in Hive, and as you right things back with Hive, they're immediately available to the application.
Brisk includes both Hadoop MapReduce and Hive, letting you run number-crunching jobs across commodity-hardware clusters. But it swaps out the Hadoop HDFS file system in favor of a compatible storage layer powered by Cassandra. And at the same time, you can use Cassandra as it was intended: as a database for real time applications. That said, Brisk does not eliminate some of the Hadoop’s single points of failure. According to developer’s documentation, Hadoop/Cassandra cluster configuration still requires:
one server in the cluster [that] should be dedicated to the following Hadoop components:This dedicated server is required because Hadoop uses HDFS to store JAR dependencies for your job, static data, and other required information. In the overall context of your cluster, this is a very small amount of data, but it is critical to running a MapReduce job.
- JobTracker
- datanode
- namenode
At the moment, Brisk is just a little more than talk. The platform has not been used on production systems. It hasn't even been open-sourced. But one way or another, it's a head-turning proposition.