SpringSource has released Spring for Apache Hadoop 1.0. Spring for Apache Hadoop allows developers to write Hadoop applications under the Spring Framework. It also allows easily integration with Spring Batch and Spring Integration. Spring for Apache Hadoop is a subproject of the Spring Data umbrella project, and is released under the open source Apache 2.0 license.
Hadoop applications generally are a collection of command line utilities, scripts and code. Spring for Apache Hadoop provides a consistent programming and declarative configuration model for developing Hadoop applications. Hadoop applications can now be implemented using the Spring programming model (Dependency Injection, POJOs, Helper Templates) and run as standard Java applications instead of command line utilities. Spring for Apache Hadoop supports reading from and writing to HDFS, running MapReduce, Streaming or Cascading jobs, and interacting with HBase, Hive and Pig.
The key features of Spring for Apache Hadoop include:
- Declarative configuration to create, configure, and parameterize Hadoop connectivity and MapReduce, Streaming, Hive, Pig, and Cascading jobs. There are "runner" classes that execute the different Hadoop interaction types, namely JobRunner, ToolRunner, JarRunner, HiveRunner, PigRunner, CascadeRunner and HdfsScriptRunner.
- Comprehensive HDFS data access support using any JVM based scripting language, such as Groovy, JRuby, Jython and Rhino.
- Template classes for Pig and Hive, named PigTemplate and HiveTemplate. These helper classes provide exception translation, resource management, and lightweight object mapping features.
- Declarative configuration for HBase, and the introduction of HBaseTemplate for DAO support.
- Declarative and programmatic support for Hadoop Tools, including File System Shell (FsShell) and Distributed Copy (DistCp).
- Security support. Spring for Apache Hadoop is aware of the security constraints of the running Hadoop environment so moving from a local development environment to a fully Kerberos-secured Hadoop cluster is transparent.
- Spring Batch support. With Spring Batch, multiple steps can be coordinated in a stateful manner and administered using a REST API. For example, Spring Batch's ability to manage the processing of large files can be used to import and export files to and from HDFS.
- Spring Integration support. Spring Integration allows for the processing of event streams that can be transformed or filtered before being read and written to HDFS or other storage.
Here are sample configuration and code snippets, mostly taken from the Spring for Hadoop blog or reference manual.
MapReduce
<!-- use the default configuration --> <hdp:configuration /> <!-- create the job --> <hdp:job id="word-count" input-path="/input/" output-path="/ouput/" mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper" reducer="org.apache.hadoop.examples.WordCount.IntSumReducer" /> <!-- run the job --> <hdp:job-runner id="word-count-runner" pre-action="cleanup-script" post-action="export-results" job="word-count" run-at-startup="true" />
HDFS
<!-- copy a file using Rhino --> <hdp:script id="inlined-js" language="javascript" run-at-startup="true"> importPackage(java.util) name = UUID.randomUUID().toString() scriptName = "src/main/resources/hadoop.properties" // fs - FileSystem instance based on 'hadoopConfiguration' bean fs.copyFromLocalFile(scriptName, name) </hdp:script>
HBase
<!-- use default HBase configuration --> <hdp:hbase-configuration /> <!-- wire hbase configuration --> <bean id="hbaseTemplate" class="org.springframework.data.hadoop.hbase.HbaseTemplate" p:configuration-ref="hbaseConfiguration" />
// read each row from HBaseTable (Java) Listrows = template.find("HBaseTable", "HBaseColumn", new RowMapper () { @Override public String mapRow(Result result, int rowNum) throws Exception { return result.toString(); } }));
Hive
<!-- configure data source --> <bean id="hive-driver" class="org.apache.hadoop.hive.jdbc.HiveDriver" /> <bean id="hive-ds" class="org.springframework.jdbc.datasource.SimpleDriverDataSource" c:driver-ref="hive-driver" c:url="${hive.url}" /> <!-- configure standard JdbcTemplate declaration --> <bean id="hiveTemplate" class="org.springframework.jdbc.core.JdbcTemplate" c:data-source-ref="hive-ds"/>
Pig
<!-- run an external pig script --> <hdp:pig-runner id="pigRunner" run-at-startup="true"> <hdp:script location="pig-scripts/script.pig"/> </hdp:pig-runner>
To get started, you can download Spring for Apache Hadoop, or use the org.springframework.data:spring-data-hadoop:1.0.0.RELEASE Maven artifact. The WordCount example for Spring for Hadoop is also available. There is also the Introducing Spring Hadoop webinar on YouTube.
Spring for Apache Hadoop requires JDK 6.0 and above, Spring Framework 3.0 (3.2 recommended) and above, and Apache Hadoop 0.20.2 (1.0.4 recommended). Hadoop YARN, NextGen or 2.x, is NOT supported at this time. Any Apache Hadoop 1.0.x distribution should be supported, and this includes distributions such as vanilla Apache Hadoop, Cloudera CDH3 and CDH4, Greenplum HD.
For in-depth information, you can read the Spring for Apache Hadoop Reference Manual and Javadoc. The Spring for Apache Hadoop source code and examples are hosted on GitHub.