HBase Leads Discuss Hadoop, BigTable and Distributed Databases
1. How would you describe HBase to someone first hearing about it?
HBase is an open-source, distributed, column-oriented store modeled after the Google paper, "Bigtable: A Distributed Storage System for Structured Data" by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. HBase is a subproject of Hadoop with its home at Apache.
The HBase project is for those whose yearly Oracle license fees approach the GNP of a small country or whose MySQL install is starting to buckle because tables have a few BLOB columns and the row count is heading north of a couple of million rows. Any one who has mountains of structured or semi-structured data and who is currently up against the limits with their RDBMS should come take a look at HBase. Better, consider weighing in on the the project. We're not about to hit our humble goal -- Very Large Tables of versioned cells, billions of rows * millions of columns hosted atop clusters of 'commodity' servers -- any time soon without the backing of a broad group of users, supporters, and contributors.
2. Why did the team start the project?
Powerset, where Jim and Stack work, needed a Bigtable-like data store to hold its webtable, a wide table of web documents and their attributes keyed by URL. Rapleaf, Bryan's employer, joined in on the project when it became apparent that they would need a Bigtable-like storage system to host a large table of profiles, as well as a host of other types of data.
3. How does it compare to Hypertable?
Clearly both projects set out to solve generally the same problem - open-source Bigtable. Hypertable is C++ while HBase is Java. HBase has a head-start in terms of how long we've been in open development, as well as the number of committers and outside contributions.
The choice of Java allows us to integrate more tightly with Hadoop than Hypertable can - when we use HDFS, we don't need another process started to act as broker between the Java and C++ worlds nor do we have to cross the JNI "great divide". Also, because we use Java, we had a leg-up because a good part of our core types and functionality had already been written and debugged by an active community of "Smart Folks" over on the Hadoop Core project.
The Hypertable project has a singular focus on "performance" and feels strongly that only C++ can deliver in this regard. Interestingly, as we understand it, most of the Hadoop development is being done by a team at Yahoo that used to work in C++ and that reportedly balked at a Java MapReduce framework for many of the same reasons cited by Hypertable. It appears the Hadoop team have gotten over that particular concern; where Java suffers performance or otherwise, they make the appropriate redress and move on. For example, Hadoop/HBase use native libraries for compression because here Java performs poorly.
HBase needs to do a bunch of work around performance for sure -- the above cited core types and the RPC transport need to be recast to better suit HBase use patterns -- but currently our focus is elsewhere. We're trying to follow the path taken by the Hadoop project concentrating on robustness, scaling, correctness, and community-building first. Later, we'll get to speeding it all up. When the time comes, we'll be sure to post the invites far and wide to the Hypertable vs. HBase Drag Race Smackdown.
Sporting rivalry aside, the Hypertable fellas are our compañeros. We talk on a fairly regular basis and in general are about helping each other out.
4. What are your thoughts on Google App Engine Exposing BigTable?
It's very interesting to see Google following Amazon's lead in this regard, especially because Google's systems are the "reference" implementations of all the concepts both Hadoop and Amazon are working on. However, as a lot of people have noted since the announcement of App Engine, there's a big difference between owning your infrastructure and renting it. It's probably a very good thing for you when you are small, but as soon as you reach a surprisingly low threshold, you're better off hosting it yourself.
Likewise, there's the problem of lock-in: try moving your app out of App Engine once it actually does get popular, even if it makes economic sense to have your own hardware. You won't have all the software pieces that your system was built on. In a lot of ways, it seems like a step backwards from the advantages of LAMP.
That said, an implementation of the Google App Engine DataStore API that went against HBase and that parsed GQL, etc., is a contribution we wouldn't say no to.
5. The M/R paradigm applies well to batch processing of data. How does Hadoop apply in a more transaction/single request based paradigm?
MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing. While HBase uses HDFS from Hadoop Core, it doesn't use MapReduce in its common operations.
However, HBase does support efficient random accesses, so it can be used for some of the transactional elements of your business. You will take a raw performance hit over something like MySQL, but you get the benefit of very good scaling characteristics as your transactional throughput grows. But you can eat your cake too in that HBase has gotten some nice contributions from the folks at IBM Research that makes it easy to use HBase as a MapReduce source and destination, so your HBase based-data can participate in MapReduce batch processing operations also.
6. What has been the best thing you've found working with Hadoop?
Being a Hadoop subproject is like being hooked to a twin turbo. The biggest boost comes from having ready access to the Hadoop core developers. Also, being part of the Hadoop community has attracted users to HBase. We get to take advantage of a huge amount of work that's already been done in Hadoop - many bits of HBase are reused from Hadoop. We've also been exposed to input and review from the Hadoop community at large, which is an enormous benefit.
The second boost comes from our being part of Apache. The Apache meritocracy has a bunch of already-developed processes and infrastructure that we can exploit and that we don't have to develop ourselves.
7. The worst?
We only see upside (Smile). If we must say something....
In a lot of ways, Hadoop's development of HDFS and MapReduce have been one and the same, so sometimes it's hard to get the core developers to understand the differences in our uses of HDFS; for example, MapReduce doesn't normally do random reads as HBase must.
And there is the lack of an append operation in HDFS (See HADOOP-1700). Without it, HBase can lose data on server crash. Its looking like we will get this feature in Hadoop 0.18.0.
8. What companies are using HBase?
Powerset and Rapleaf are at the forefront. Companies we know that are actively using HBase loaded with sizeable datasets include WorldLingo and Wikia. Many others are taking their first steps into using HBase. If any one else is interested in using HBase, let us know!
9. What does the future hold for HBase?
In the near future, we're about stabilizing our 0.1 branch. We'll release 0.1.2 in the next week or so. We see a stable offering as a key means of developing a user base and a set of contributors. Otherwise, in our next significant release in May, 0.2, you'll see big improvements in robustness, a bunch of better cluster self-management features like region rebalancing, and an improved client API.