Pioneer is a simple async HTTP crawler based on em-synchrony. But it crawls by your rules. You should specify urls to crawl, or specify how to find those urls to crawl.
gem install pioneer
Basically you need to define two methods: locations
and processing
to create your own crawler.
locations
method should return any Enumerable, in the simplest case - an Array. processing
methods accepts Request
object and should do something with it: save response to file, find new urls to crawl etc.
Let's download few web pages to file (Rayan, I am so sorry!):
require 'pioneer'
class Crawler < Pioneer::Base
def locations
["http://railscasts.com/episodes/355-hacking-with-arel", "http://railscasts.com/episodes/354-squeel"]
end
def processing(req)
filename = req.url.split("/").last + ".html"
File.open(filename, "w+") do |f|
f << req.response.response
end
end
end
Crawler.new.start
Ok, we got it: two files 354-squeel.html and 355-hacking-with-arel.html.
So. There is some standart methods, which you can redefine:
- locations
- processing(req)
- if_request_error(req)
- if_response_error(req)
- if_status_not_200(req)
- if_status_XXX(req) (where XXX is any status you want, for example,
if_status_300
,if_status_301
)
And few helpers for Request
object:
- retry(count)
- skip
You can call req.retry
or req.skip
in any of those if_xxx
methods:
class Crawler < Pioneer::Base
...
def if_request_error(req)
req.retry
end
def if_status_not_200(req)
req.skip
end
end
You can specify amount of retries in retry
method: req.retry(10)
. It means, that after 10 retries crawler will skip this request.
If, while you are crawling, there is some request error (Internet connection error or something else) crawler will raise an error, or will call your if_request_error
if defined. You can retry a request, or skip it.
If server can't handle our request we will get an error. Crawler will fire an Exception or will call if_response_error
if defined. You can retry a request, or skip it.
Basicaly we need status 200 from request, but we can get redirect status, or page not found or anything else. So you can handle your behaviour in this callback
require 'pioneer'
class Crawler < Pioneer::Base
def locations
["http://www.google.com", "http://www.apple.com/not_found"]
end
def processing(req)
File.open(req.url, "w+") do |f|
f << req.response.response
end
end
def if_request_error(req)
puts "Request error: #{req.error}"
end
def if_response_error(req)
puts "Response error: #{req.response.error}"
end
def if_status_203(req)
puts "He is trying to redirect me"
end
end
Crawler.new.start
#=> I, [2012-06-02T00:53:55.876818 #5099] INFO -- : going to http://www.google.com
#=> I, [2012-06-02T00:53:55.884415 #5099] INFO -- : going to http://www.apple.com/not_found
#=> E, [2012-06-02T00:53:55.959504 #5099] ERROR -- : This http://www.google.com returns this http status: 302
#=> E, [2012-06-02T00:53:56.360271 #5099] ERROR -- : This http://www.apple.com/not_found returns this http status: 404
What is req
?
req.response id em-http-request
response object ;)
You can override all methods on the fly:
crawler = Pioneer::Crawler.new # base simple crawler
crawler.locations = [url1, url2]
crawler.processing = proc{ req.response.response_header.status }
crawler.if_status_404{ |req| "Oups" }
...
You can pass options to crawler:
class Crawler < Pioneer::Base
def locations ...
def processung ...
end
crawler = Crawler.new( concurrency: 100, sleep: 0.01, redirects: 2 ... )
name
Crawler name (for logs)concurrency
concurrency level: how many parallel requests will be handled (default 10)sleep
how log should crawler wait between each request (default 0)log_enabled
logging is enabled by defaultlog_level
log level is Logger::DEBUG by defaultrandom_header
crawler can be a copycat of real browser, you need to turn on random_header to get random browser header (false, by default)header
you can pass your own headers as a hash (cookies i.e.)redirects
how many redirects can crawler do (0 by default)headers
you can specify your own headers callback: manual handle redirects or whatever (see em-http-request, headers callback)request_opts
em-http-request options
.. to be continued