Open source web-search framework Apache Nutch version 2 supports large scale crawling, link-graph database and HTML parsing. InfoQ spoke with Julien Nioche, VP of Apache Nutch project, about the framework new features and its future roadmap.
In his new article Jonathan Natkins explains how to use components of Apache Hadoop, including Flume, Hive and Oozie to implement a typical Data management system. He also gives a practical example of such architecture to measure Twitter user’s influence.
A new Apache HCatalog provides a metadata and table management system for Hadoop ecosystem, simplifying data interoperability between different data processing tools
This article contains an interview with Dipti Borkar, Director of Product Management at Couchbase, on the challenges, benefits and the process of migrating from RDBMS to NoSQL. 6
In this article, authors Arun Viswanathan and Shruthi Kumar discuss how to implement common aggregation functions on a MongoDB document database using its MapReduce functionality. 7
Approaches to integrating data are changing with emergence of cloud computing. 2
In this article, Boris Lublinsky shows how to extend Hbase - based Lucene implementation with geospatial search support.
Usage of custom Hadoop OutputFormat allows to produce output data in a form most appropriate for other applications. 2
InputFormat class provides a powerful mechanism for tighter control of Maps execution in Map Reduce jobs. In this article authors show how to leverage this mechanism for solving specific problems. 1
In this article authors show how to extend Oozie by introducing custom actions, specific for a given company/line of business. 4
Complete Oozie example, demonstrating language features and their usage in real world examples 2