Apache Hadoop 1.0.0 Supports Kerberos Authentication, Apache HBase and RESTful API to HDFS
- Security (strong authentication via Kerberos authentication protocol)
- Support for Apache HBase (sync and flush support for transaction logging). This allows new writes to happen to the HDFS client even when a hflush/sync is in progress.
- Webhdfs which includes RESTful API to Hadoop Distributed File System (HDFS). This feature provides webhdfs as a complete FileSystem implementation for accessing HDFS over HTTP. Previous hftp feature was a read-only FileSystem and does not provide "write" accesses.
- Performance enhanced access to local files for HBase
Other features in the new release include some performance enhancements, bug fixes, and features.
InfoQ caught up with Arun Murthy, VP of Apache Hadoop Project, about the features in 1.0.0 release and what features will be included the next release.
InfoQ: Apache Hadoop 1.0.0 was released after six years of development work. Why it took so long for the first release?
Arun Murthy: Apache Hadoop is already used in production environment at several large enterprises such as Yahoo, Facebook etc. The 1.0.0 moniker is more of a statement from the Apache Hadoop community that the release is a indeed mature one and is something the community is confident of supporting in a compatible manner for the foreseeable future for a wide variety of use cases in various enterprises. This should increase confidence of end-users and enterprises and aid further adoption of Apache Hadoop.
InfoQ: What type of security features does this release support, in terms of authentication, access control and data encryption?
Arun: 1.0.0 supports strong, end-to-end Kerberos based authentication for both HDFS (filesystem for storage) and MapReduce (data processing). Kerberos is by far the most popular network authentication protocol used in the enterprise.
It also provides strong access control at all levels for applications and data. For example, one can ensure that only a certain individual (or set of users) can view running applications, see application logs etc.
InfoQ: Can you discuss the performance enhancements made in the new release?
Arun: There are several enhancements. A prime example is the local-read optimizations we have done for applications like Apache HBase which provide significant boost (2x in certain cases).
InfoQ: What are some new features you are planning on releasing in the next version of Hadoop?
Arun: The next major release of Apache Hadoop is currently in alpha stage and expected to be released by middle of 2012. Some major highlights are:
- High Availability for HDFS (filesystem) - Solving the SPOF issue for the filesystem.
- HDFS Federation to scale the FS namesystem by a at least 4x-5x allowing for significantly larger clusters (both in terms of nodes in the cluster and number files in the namesystem).
- NextGen MapReduce (aka YARN) to turn Hadoop from just supporting MapReduce applications to a general purpose, distributed, computation fabric where multiple paradigms such as MapReduce, Message Passing Interface (MPI), Iterative programming etc. can be supported within the same Hadoop cluster, simultaneously. This also allows Hadoop to support much larger clusters (6000 - 10000 nodes) and support High Availability for the compute fabric.
Arun also said that they feel the next version of Apache Hadoop significantly improves Hadoop with many enterprise grade features such as High Availability and allows Hadoop to be used in even wider variety of use-cases (i.e. NextGen MR aka YARN) in the enterprise.