Apache HBase on Amazon EMR
This week Amazon has released HBase on Amazon Elastic Map Reduce (EMR). As Jeff Barr explains:
AWS has already given you a lot of storage and processing options to choose from, and today we are adding a really important one. You can now use Apache HBase to store and process extremely large amounts of data (think billions of rows and millions of columns per row) on AWS.
Amazon’s decision to include HBase support into EMR is predicated on a following list of important features, cited by Barr:
- Strictly consistent reads and writes.
- High write throughput.
- Automatic sharding of tables.
- Efficient storage of sparse data.
- Low-latency data access via in-memory operations.
- Direct input and output to Hadoop jobs.
- Integration with Apache Hive for SQL-like queries over HBase tables, joins, and JDBC support.
The version of HBase available on EMR is 0.92. According to Barr, the use cases, that where driving adoption of HBase in EMR include:
- Support for Reference Data for Hadoop Analytics - Because HBase provides rapid access to stored data; it is a great way to store reference data that can be used by Hadoop jobs on either a single or across multiple Hadoop clusters.
- Alternative Data Storage option for data Ingestion and Batch Analytics - due to its high write throughput and efficient storage of sparse data, HBase can handle real-time ingestion of large data volumes. Combined with support for sequential reads and highly optimized scans HBase provides a powerful tool for "close to real time" analytics.
- Implementation of High Frequency Counters and Summary Data - build in support for strictly consistent reads and writes makes it an ideal platform for storing counters and summary data. Map Reduce jobs can be used for calculation of complex aggregations such as max-min, sum, average, and group-by and the results of these jobs can be piped back into an HBase.
EMR makes launching HBase on an EMR cluster quite simple, according to the documentation - the user needs to specify that he wants HBase support in his job flow, and everything will be automatically installed. As with any other EMR job instance one can still specify the cluster size and machine type to be used.
At first glance it would seem that EMR and HBase are not suited for each other - EMR is typically created to execute a given job and then terminated, while HBase typically serves as a permanent data storage. AWS allows to support this combination by implementing backup and restore for HBase, which supports both full and incremental backups of HBase data to S3. This functionality enables HBase to survive EMR’s termination without any data loss. The other option, described in the Hive example is to have a dedicated HBase cluster that can be accessed by other EMR clusters. As an additional security measure, Amazon EMR launches HBase clusters with termination protection turned on. This prevents the cluster from being terminated inadvertently or in the case of an error.
Everyone who ever used HBase knows that one the most difficult problems usign it is cluster configuration. Although default configuration provided by Amazon is a good starting point for HBase usage, Amazon provides users with the capabilities to optimize this configuration for specific usage
All in all, Hbase seems like a great addition to EMR, where it can be efficiently and cost-effectively used to extend EMR’s capabilities. It can also be used as an easy entrance in HBase usage, leveraging simplicity of EMR installation and usage.
Yahoo! Cloud Serving Benchmark - HBase 0.92 on Amazon Elastic MapReduce