InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

HBase Leads Discuss Hadoop, BigTable and Distributed Databases

Posted by Scott Delap on Apr 28, 2008

Sections
Architecture & Design,
Development,
Operations & Infrastructure
Topics
Data Access ,
Java ,
Cloud Computing ,
Database Design
Tags
Hadoop ,
Google
Google's recent introduction of their Google Application Engine and its inclusion of access to BigTable has created renewed interest in alternative database technologies. A few weeks back InfoQ interviewed Doug Judd a founder of the Hypertable project which is inspired by Google's BigTable database. This week InfoQ has the pleasure of presenting an interview with HBase leads im Kellerman, Michael Stack, and Bryan Duxbury. HBase is is an open-source, distributed, column-oriented store also modeled after BigTable.

1. How would you describe HBase to someone first hearing about it?

HBase is an open-source, distributed, column-oriented store modeled after the Google paper, "Bigtable: A Distributed Storage System for Structured Data" by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. HBase is a subproject of Hadoop with its home at Apache.

The HBase project is for those whose yearly Oracle license fees approach the GNP of a small country or whose MySQL install is starting to buckle because tables have a few BLOB columns and the row count is heading north of a couple of million rows. Any one who has mountains of structured or semi-structured data and who is currently up against the limits with their RDBMS should come take a look at HBase. Better, consider weighing in on the the project. We're not about to hit our humble goal -- Very Large Tables of versioned cells, billions of rows * millions of columns hosted atop clusters of 'commodity' servers -- any time soon without the backing of a broad group of users, supporters, and contributors.

2. Why did the team start the project?

Powerset, where Jim and Stack work, needed a Bigtable-like data store to hold its webtable, a wide table of web documents and their attributes keyed by URL. Rapleaf, Bryan's employer, joined in on the project when it became apparent that they would need a Bigtable-like storage system to host a large table of profiles, as well as a host of other types of data.

3. How does it compare to Hypertable?

Clearly both projects set out to solve generally the same problem - open-source Bigtable. Hypertable is C++ while HBase is Java. HBase has a head-start in terms of how long we've been in open development, as well as the number of committers and outside contributions.

The choice of Java allows us to integrate more tightly with Hadoop than Hypertable can - when we use HDFS, we don't need another process started to act as broker between the Java and C++ worlds nor do we have to cross the JNI "great divide". Also, because we use Java, we had a leg-up because a good part of our core types and functionality had already been written and debugged by an active community of "Smart Folks" over on the Hadoop Core project.

The Hypertable project has a singular focus on "performance" and feels strongly that only C++ can deliver in this regard. Interestingly, as we understand it, most of the Hadoop development is being done by a team at Yahoo that used to work in C++ and that reportedly balked at a Java MapReduce framework for many of the same reasons cited by Hypertable. It appears the Hadoop team have gotten over that particular concern; where Java suffers performance or otherwise, they make the appropriate redress and move on. For example, Hadoop/HBase use native libraries for compression because here Java performs poorly.

HBase needs to do a bunch of work around performance for sure -- the above cited core types and the RPC transport need to be recast to better suit HBase use patterns -- but currently our focus is elsewhere. We're trying to follow the path taken by the Hadoop project concentrating on robustness, scaling, correctness, and community-building first. Later, we'll get to speeding it all up. When the time comes, we'll be sure to post the invites far and wide to the Hypertable vs. HBase Drag Race Smackdown.

Sporting rivalry aside, the Hypertable fellas are our compañeros. We talk on a fairly regular basis and in general are about helping each other out.

4. What are your thoughts on Google App Engine Exposing BigTable?

It's very interesting to see Google following Amazon's lead in this regard, especially because Google's systems are the "reference" implementations of all the concepts both Hadoop and Amazon are working on. However, as a lot of people have noted since the announcement of App Engine, there's a big difference between owning your infrastructure and renting it. It's probably a very good thing for you when you are small, but as soon as you reach a surprisingly low threshold, you're better off hosting it yourself.

Likewise, there's the problem of lock-in: try moving your app out of App Engine once it actually does get popular, even if it makes economic sense to have your own hardware. You won't have all the software pieces that your system was built on. In a lot of ways, it seems like a step backwards from the advantages of LAMP.

That said, an implementation of the Google App Engine DataStore API that went against HBase and that parsed GQL, etc., is a contribution we wouldn't say no to.

5. The M/R paradigm applies well to batch processing of data. How does Hadoop apply in a more transaction/single request based paradigm?

MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing. While HBase uses HDFS from Hadoop Core, it doesn't use MapReduce in its common operations.

However, HBase does support efficient random accesses, so it can be used for some of the transactional elements of your business. You will take a raw performance hit over something like MySQL, but you get the benefit of very good scaling characteristics as your transactional throughput grows. But you can eat your cake too in that HBase has gotten some nice contributions from the folks at IBM Research that makes it easy to use HBase as a MapReduce source and destination, so your HBase based-data can participate in MapReduce batch processing operations also.

6. What has been the best thing you've found working with Hadoop?

Being a Hadoop subproject is like being hooked to a twin turbo. The biggest boost comes from having ready access to the Hadoop core developers. Also, being part of the Hadoop community has attracted users to HBase. We get to take advantage of a huge amount of work that's already been done in Hadoop - many bits of HBase are reused from Hadoop. We've also been exposed to input and review from the Hadoop community at large, which is an enormous benefit.

The second boost comes from our being part of Apache. The Apache meritocracy has a bunch of already-developed processes and infrastructure that we can exploit and that we don't have to develop ourselves.

7. The worst?

We only see upside (Smile). If we must say something....

In a lot of ways, Hadoop's development of HDFS and MapReduce have been one and the same, so sometimes it's hard to get the core developers to understand the differences in our uses of HDFS; for example, MapReduce doesn't normally do random reads as HBase must.

And there is the lack of an append operation in HDFS (See HADOOP-1700). Without it, HBase can lose data on server crash. Its looking like we will get this feature in Hadoop 0.18.0.

8. What companies are using HBase?

Powerset and Rapleaf are at the forefront. Companies we know that are actively using HBase loaded with sizeable datasets include WorldLingo and Wikia. Many others are taking their first steps into using HBase. If any one else is interested in using HBase, let us know!

9. What does the future hold for HBase?

In the near future, we're about stabilizing our 0.1 branch. We'll release 0.1.2 in the next week or so. We see a stable offering as a key means of developing a user base and a set of contributors. Otherwise, in our next significant release in May, 0.2, you'll see big improvements in robustness, a bunch of better cluster self-management features like region rebalancing, and an improved client API.

Hyperlinks by Andy Jefferson Posted
  1. Back to top

    Hyperlinks

    by Andy Jefferson

    seem to be "javascript:void(0)". Really helps people that does ...

Educational Content

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.

Beauty Is in the Eye of the Beholder

Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.

Architecting Visa for Massive Scale and Continuous Innovation

John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.

Max Protect: Scalability and Caching at ESPN.com

Sean Comerford unveils ESPN.com’s architecture, what components are used and why, and the current changes the website goes through.

The Seven Deadly Sins of Enterprise Agile Adoption

Are there repeated patterns of failure on Enterprise Agile Enablement efforts? Sanjiv and Arlen discuss Seven Deadly Sins to avoid when adopting Agile in an enterprise.

Questions for an Enterprise Architect

Erik Dörnenburg answers: What is Enterprise and Evolutionary Architecture?, discussing 4 issues: Turning strategy into execution, Ensuring conformance, Where do the architects sit? Buying or building?

Wrap Your SQL Head Around Riak MapReduce

Sean Cribbs explains what Map-Reduce and Riak are, why and how to use Map-Reduce with Riak, and how to convert SQL queries into their Map-Reduce equivalents.