Julien Nioche on Apache Nutch 2 Features and Product Roadmap
Open source web-search framework Apache Nutch version 2.1, which was released three weeks ago, supports improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search. Nutch can run on a single machine, but it can also be used as a large scale crawl platform running in a Hadoop cluster.
Version 2.0 of the framework, which was released in July after two years of development, builds on storage abstraction using the Apache Gora framework. The Apache Gora open source framework provides an in-memory data model and persistence for big data. It supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support. Gora was graduated into a Top Level Project at Apache earlier this year.
Nutch 2 supports big data stores such as the distributed key/value store Apache Accumulo, data serialization system Apache Avro, column family data store Apache Cassandra, distributed big data store Apache HBase, and Hadoop Distributed File System (HDFS).
InfoQ spoke with Julien Nioche, VP of Apache Nutch and Director of DigitalPebble Ltd. He will also be speaking about large scale crawling with Nutch framework at the Apache Conference Europe next week.
InfoQ: Where does Apache Nutch framework fit in the NoSQL databases and Big Data space?
Julien: Nutch definitely bears the 'BigData' label. For one thing, it gave birth to what became Apache Hadoop which is the de-facto framework for large scale processing. Nutch has been designed for large scale crawling of the web. Some of our users have clusters of hundreds of servers running Nutch and holding billions of pages.
As for its relation to NoSQL, that's exactly what Nutch 2 is about. Whereas the 1.x branch relies on the Hadoop data structures, which are great for batch processing, the version 2 relies on Apache GORA to provide a unified front end over various NoSQL datastores.
InfoQ: Apache Gora framework came out of the Nutch project. Can you discuss how Gora can help the application developers as an ORM framework for NoSQL databases?
Julien: I like to think about GORA as a form of 'JDBC for NoSQL databases' as it provides an abstraction over the storage and allows developers to write code which is neutral from any specific APIs. Part of the GORA API is also about providing a MapReduce API over the various backends and a serialization mechanism based on Apache AVRO. Of course it also does the basic atomic GET-PUT-DELETE operations.
Apache GORA is now at version 2.1 and supports datastores such as HBase, Cassandra, Accumulo but has also a SQL module! This means that you can run MapReduce over say a MySQL database, which some Nutch 2 users do. What we actually see with Nutch 2 is that people prefer different storages, which is why GORA is very useful to us.
InfoQ: The latest version also has HTML parsing support handled by Apache Tika framework. Can you elaborate how this feature works?
Julien: Apache Tika is an open source library implemented in Java which allows to extract text and metadata from a variety of formats (HTML, PDF, Word, etc.) and can also used for language and mime-type identification. It is actually a wrapper around existing third-party parsers such as PDFBox and it provides a unified API to use these wrappers. Tika was already used in the Nutch 1.x branch alongside our legacy Nutch parsers, so it is not really a novelty in Nutch 2.0. Interestingly Apache Tika is another project born out of Nutch, just like Hadoop and GORA.
InfoQ: What is the future roadmap of Nutch project in terms of upcoming releases and features?
Julien: The releases do not follow a strict schedule. Basically we release when we think that a substantial amount of work has been done, which itself depends on how many contributions we get, how quickly users adopt the tool, etc. Nutch 1.x and 2.x will certainly coexist for some time until 2.x is completely mature and their releases will probably not happen at the same times. Lately we've had on average 2 releases per year but as 2.x is gaining traction we'll probably release new versions more often than that.
As for the features, the most important one will be the upgrade to SOLR 4 and its cloud functionalities. We'll also probably see more delegation of functionalities to third-party projects such as Crawler Commons so that other projects can reuse and improve the code. We've also discussed making the indexing backend pluggable: at the moment we support only SOLR (and ElasticSearch for 2.x) but we want developers to be able to write new indexing backends using the plugin mechanism without having to piggyback the code. Delegating our page rank mechanism to a graph library like Apache Giraph would probably save us quite a bit of code and be more efficient. I expect that most of the effort will be focused in consolidating the code for 2.x though.
He also talked about the ten year completion of the project:
Julien: Apache Nutch has recently turned 10 which is quite old for a piece of software. The reason why it still exists I think is that it is good at what it does and does not try to reinvent the wheel. Interestingly Nutch now benefits from the progress made by projects which originated from it like Hadoop or Tika and I hope that the same will be true about GORA. Nutch 2 is quite an exciting development and we are seeing quite a few new users embrace it. There are also new contributors and committers joining all the time which is the sign of a healthy project.
Apache Nutch team also announced the release of Apache Nutch v1.5.1 in July. This is a maintenance release of the 1.5.x mainstream version of Nutch framework. Please see the list of changes made in this version for a full breakdown. The search framework can be downloaded from the website. For Nutch documentation and tutorials, checkout the project wiki page.
About the Interviewee
Julien Nioche is the founder of DigitalPebble Ltd, a consultancy based in Bristol, UK, specialising in Open Source Solutions for Text Engineering. Julien's expertise covers Information Retrieval, Text Analysis, Information Extraction, NLP and Machine Learning. He is also the VP for Apache Nutch, a committer on Apache Tika and Apache Gora and a contributor to several other open source projects.