A Web Archive WARC I/O module for Scrapy
$ pip install scrapy-warcio
- Create a project and spider:
https://docs.scrapy.org/en/latest/intro/tutorial.html
$ scrapy startproject <project>
$ cd <project>
$ scrapy genspider <spider> example.com
- Copy and edit
scrapy_warcio
distributedsettings.yml
with your configuration settings:
---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000 # 10GB
collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~ # robots policy (obey or ignore)
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...
-
Export
SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml
-
Add
WarcioDownloaderMiddleware
(distributed asmiddlewares.py
) to your<project>/<project>/middlewares.py
:
import scrapy_warcio
class WarcioDownloaderMiddleware:
def __init__(self):
self.warcio = scrapy_warcio.ScrapyWarcIo()
def process_request(self, request, spider):
request.meta['WARC-Date'] = scrapy_warcio.warc_date()
return None
def process_response(self, request, response, spider):
self.warcio.write(response, request)
return response
- Enable
WarcioDownloaderMiddleware
in<project>/<project>/settings.py
:
DOWNLOADER_MIDDLEWARES = {
'<project>.middlewares.WarcioDownloaderMiddleware': 543,
}
- Validate your warcs with
internetarchive/warctools
:
$ warcvalid WARC.warc.gz
- Upload your WARC(s) to your favorite web archive!
$ pydoc scrapy_warcio
or
>>> help(scrapy_warcio)
Making this a Scrapy extension may make it more useful:
https://docs.scrapy.org/en/latest/topics/extensions.html
@internetarchive