Skip to content
matteoredaelli edited this page Dec 29, 2010 · 25 revisions

The crawler and workers

Features

Database

  • the urls data are saved to a NOSQL database (apache couchdb or riak) that support map/reduce queries: NOSQL database are more cheaper and more scalable than relational databases!
  • supported backends: apache couchdb and riak

Message Queue

  • Processed and tobe processed urls are sent to AMQP queues (persistence and load balancing of crawlers)
  • Tobe processed urls are distribuited to different priority queues: you can run more crawlers for highest priority queues, and you can use your own module/function to decide the priority of urls

Crawlers

  • many crawlers can be run concurrently, also remotely
  • urls/domains can be filtered using regular expressions and custom functions
  • urls can be normalized/rewritted using many options (max_depth, remove_queries, string replacements, custom functions, ….)
  • custom body analyzers are supported: internally the ebot system (see ebot.hrl and ebot_html_body_analyzer_sample.erl) or also (async and possibly remote) outside developing a custom MQ queue consumer
  • url referrals can be saved to the database: only external and/or same domain and/or same main domain
  • many many other options: see files ebot.app and sys.config

Statistics

ebot statistics are saved to Round Robin Databases (using rrdtool)

Web Services

  • web REST interface sfor
  • managing start/stop of crawlers
  • submitting urls to crawlers (sync or async)
  • showing ebot statistics

Licence

GPL V3+

Clone this wiki locally