BT

80legs Is a Web Crawling Service

| by Abel Avram Follow 10 Followers on Dec 30, 2009. Estimated reading time: 1 minute |

80legs uses Plura’s grid of over 50,000 computers to crawl over 2 billion pages a day. Shion Deysarkar, 80legs CEO, says that their crawling services are generally requested by smaller search engines which do not afford their own large capacity grid, companies performing market research, organizations monitoring copyright infringement activities, and ad companies spying on what their competitors are doing.

The service can be accessed on demand by setting up a job and executing it. As any crawling process, the job needs a seed list which can be contained by a text file up to 1 GB in size. Other job parameters are:

  • Outgoing links – used to specify which links to crawl of those resulting from a seed
  • Depth level – the URL level measured to a seed
  • Crawling type – multiple depths in the same time or only one depth at a time
  • Number of URLs – specifies the maximum number of URLs to crawl
  • MIME types – specifies the page types to crawl
  • Analyze options – there are several analysis options like keyword matching, regular expressions, running custom code

When a job runs, the crawler starts reading web pages starting with the seed ones and considering the outgoing links options, and analyzes the content of the pages. Simple analysis is available by specifying keywords to match or by selecting information based on regular expressions, but complex analysis can be performed on the data by using a custom application or a pre-built 80legs application. The analysis application needs to be written in Java. 80legs plans to open an application store where developers can sell their applications at their desired price and will collect all the revenue. 80legs has launched a contest to attract developers.

Paid subscriptions offer access to a Python API to interact with the crawling engine. Plans are for a Perl API. Free subscribers can create and control their jobs through the 80legs Portal.

There is a free plan with some limitations: 1 job at a time, 100k pages of max 100KB each, a 10MB analysis application (Java JAR), no API, 1 hit per second for the domain searched. There are two paid subscriptions, the top one offering 5 concurrent repeatable jobs with 10M pages/job, 10 MB/page, a 10 MB JAR, and 10 hits/sec/domain for $2/million pages crawled and 3 cents for CPU-hour utilized.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

The crawling is done via Java applets embedded on webpages by Michael Stillwell

This is mentioned in the "Plura grid" page, but it's probably worth mentioning directly: the distributed crawler that's discussed is in face powered by Java applets embedded in webpages.

Contest expired by Christian Poecher

Too sad the contest is already finished.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT