Julien Nioche on StormCrawler, Open-Source Crawler Pipelines Backed by Apache Storm
Julien Nioche, director of DigitalPebble, PMC member and committer of the Apache Nutch web crawler project, talks about StormCrawler, a collection of reusable components to build distributed web crawlers based on the streaming framework Apache Storm.
InfoQ interviewed Nioche, main contributor of the project, to find out more about StormCrawler and how it compares to other technologies in the same space.
InfoQ: What stages of a crawling / scraping pipeline can benefit from StormCrawler?
Julien Nioche: StormCrawler provides code and resources for implementing all the core stages of a crawling pipeline i.e scheduling, fetching, parsing, indexing. It comes with modules for commonly used projects such as Apache Solr, Elasticsearch, MySQL or Apache Tika and has a range of extensible functionalities to do data extraction with XPath, sitemaps, URL filtering or language identification.
Nioche: StormCrawler came about as a result of my experience with Apache Nutch and owes a lot to it both for some of the concepts (e.g. design of the FetcherBolt, URL and Parse filters) and initial implementation. StormCrawler implements Nutch’s functionalities and, like Nutch 2.x, can use various storage backends (HBase, Cassandra, etc.).
The main difference between StormCrawler and Nutch is that the latter is based on (and also gave birth to) Apache Hadoop and as such is batch-driven. URL fetching, content parsing and indexing are done as separate steps. This causes the network usage to be high when fetching with CPU and disk usage relatively low and the opposite when parsing or indexing.
By opposition, StormCrawler is based on the stream processing framework Apache Storm and all operations can happen at the same time: URLs are fetched, parsed, and indexed constantly which makes the whole crawling process more efficient and without long tail workloads, common in batch-oriented approaches. Unlike Nutch, the content does not necessarily have to be persisted to disk (but it can be if necessary). StormCrawler also makes it easier to implement a wider range of use cases, such as when low latency is needed or when URLs arrive as a stream (i.e. user-generated events like page visits).
Comparing both, StormCrawler runs on a distributed, scalable environment while Scrapy is a single-process, although there are projects like Frontera to do distributed crawling.
StormCrawler delegates the distribution and reliability (plus other functionalities such as the UI, the metrics framework and logs) to Apache Storm.
Both Scrapy and StormCrawler aim at being user friendly and good solutions for data scraping.
In a nutshell, I'd say that StormCrawler is a combination of Nutch's scalability and Scrapy's user-friendliness.
Nioche: Apache Storm design and concepts are simple and efficient, and Spark didn’t exist at the time. Spark Streaming processes data in micro-batches and its declarative style wasn’t best suited to my needs.
One of the main challenges in crawling is the politeness, defined by how frequently the crawler hits a web server. Unlike most streaming applications, the objective is not just to get as many messages through as quickly as possible but to perform politely while optimising the throughput. A finer control is required and Apache Storm’s mechanisms serve the purpose.
InfoQ: What is on the roadmap for upcoming releases of StormCrawler?
Nioche: StormCrawler's development is driven by the community. The latest stable release is 1.2, is based on Storm’s 1.x version. The next release will include a language identification module and possibly a port to the brand new Elasticsearch 5. The main functionality to come at some point is to have a Selenium-based protocol implementation, which will be of use for AJAX-based sites