Home

Jump to bottom Edit New page

matteoredaelli edited this page Dec 29, 2010 · 25 revisions

The crawler and workers

Features

Database

the urls data are saved to a NOSQL database (apache couchdb or riak) that support map/reduce queries: NOSQL database are more cheaper and more scalable than relational databases!
supported backends: apache couchdb and riak

Message Queue

Processed and tobe processed urls are sent to AMQP queues (persistence and load balancing of crawlers)
Tobe processed urls are distribuited to different priority queues: you can run more crawlers for highest priority queues, and you can use your own module/function to decide the priority of urls

Crawlers

many crawlers can be run concurrently, also remotely
urls/domains can be filtered using regular expressions and custom functions
urls can be normalized/rewritted using many options (max_depth, remove_queries, string replacements, custom functions, ….)
custom body analyzers are supported: internally the ebot system (see ebot.hrl and ebot_html_body_analyzer_sample.erl) or also (async and possibly remote) outside developing a custom MQ queue consumer
url referrals can be saved to the database: only external and/or same domain and/or same main domain
many many other options: see files ebot.app and sys.config

Statistics

ebot statistics are saved to Round Robin Databases (using rrdtool)

Web Services

web REST interface sfor
managing start/stop of crawlers
submitting urls to crawlers (sync or async)
showing ebot statistics

Licence