Home

Jump to bottom Edit New page

matteoredaelli edited this page Dec 27, 2010 · 25 revisions

The crawler and workers

Features

Database

the urls data are saved to a NOSQL database (apache couchdb or riak) that support map/reduce queries: NOSQL database are more cheaper and more scalable than relational databases!
supported backends: apache couchdb and riak

Message Queue

Processed and tobe processed urls are sent to AMQP queues (persistence and load balancing of crawlers)
Tobe processed urls are distribuited to different priority queues: you can run more crawlers for highest priority queues, and you can use your own module/function to decide the priority of urls

Crawlers

many crawlers can be run concurrently, also remotely
urls/domains can be filtered using regular expressions and custom functions
urls can be normalized/rewritted using many options (max_depth, remove_queries, string replacements, custom functions, ….)
custom/external body analyzers are supported: internally the ebot system (see ebot.hrl and ebot_plugin_body_analyzer_sample.erl) or better (async and possibly remote) outside developing a custom queue consumer
url referrals can be saved to the database: only external and/or same domain and/or same main domain
many many other options: see files ebot.app and sys.config

Statistics

ebot statistics are saved to Round Robin Databases (using rrdtool)

Web Services

web REST interface sfor
managing start/stop of crawlers
submitting urls to crawlers (sync or async)
showing ebot statistics

Licence

Toggle table of contents Pages 9

Add a custom sidebar

Clone this wiki locally