Scrapy + Frontera (A crawling project) - translated from the original Japanese version. (Thanks RY-2718)
- Scrapy
- Frontera
- Apache Kafka
- Apache HBase
- Twisted Python
- docs/twisted-change.md (Change according to this instructions)
virtualenv - Please do not do it unless you separate the environment with etc. Solve dependencies for distibuted / kafka / hbase, then uninstall and install via the master.
$ pip install scrapy colorlog msgpack-python frontera[distributed,kafka,hbase]
$ pip uninstall frontera
$ pip install pip install git+https://github.com/scrapinghub/frontera.git
Settings for scrapy's behavior are '/crawler/settings.py'. Frontera's settings are '/frontier/common.py', '/frontier/*_settings.py'. logging settings are in 'logging.conf'
The items to be set at a minimum are listed below.
BUCKET_NAME # S3[Bucket name]
SPIDER_FEED_PARTITIONS # number of spiders(Scrapy)
SPIDER_LOG_PARTITIONS # number of workers(Frontera)
KAFKA_LOCATION # Kafka location: e.g., 'localhost:9092'
# Settings related to kafka:
# All is fine as long as it matches between Scrapy and Frontera, but it seems reasonable to slightly change the default name.
SPIDER_LOG_DBW_GROUP
SPIDER_LOG_SW_GROUP
SCORING_LOG_DBW_GROUP
SPIDER_FEED_GROUP
SPIDER_LOG_TOPIC
SPIDER_FEED_TOPIC
SCORING_LOG_TOPIC
HBASE_THRIFT_HOST = 'localhost' # HBase location
HBASE_THRIFT_PORT = 9090 # Port number where HBase's Thrift client runs, default is 9090
HBASE_METADATA_TABLE = 'metadata' # The table name created by Frontera. If it is not created, Frontera creates it automatically.
HBASE_QUEUE_TABLE = 'queue' # The table name created by Frontera. If it is not created, Frontera creates it automatically.
Introduce Kafka, create a topic (match SPIDER_LOG_TOPIC, SPIDER_FEED_TOPIC, SCORING_LOG_TOPIC
above).
An example command is shown below. For details, please refer to kafka document.
$ /path/to/kafka/bin/kafka-topics.sh --create --topic frontier-done --replication-factor 1 --partitions 1 --zookeeper localhost:2181
$ /path/to/kafka/bin/kafka-topics.sh --create --topic frontier-score --replication-factor 1 --partitions 1 --zookeeper localhost:2181
$ /path/to/kafka/bin/kafka-topics.sh --create --topic frontier-todo --replication-factor 1 --partitions 2 --zookeeper localhost:2181
Also, introduce HBase and create a namespace called crawler
.
An example command is shown below. For details, refer to HBase document.
$ hbase shell
> create_namespace 'crawler'
It is assumed that Kafka + zookeeper is running.
Launch two terminals and start each worker of frontera. It restarts every time frontera's worker terminates in run _ *. sh
..
$ cd /path/to/project/root
$ bash scripts/run_db.sh
$ cd /path/to/project/root
$ bash scripts/run_strategy.sh
At the time of termination, we terminate frontera as follows. I try to hit a script to stop frontera's loop.
$ cd /path/to/project/root
$ bash scripts/kill_frontera_loop.sh
It is assumed that frontera worker is running.
Create partition_id.txt
in the project root as follows and execute 'scripts/init.sh'.
The number now is the ID of Scrapy managed by Frontera.
In this example, the ID of Scrapy is 0.
$ cd /path/to/project/root
$ echo 0 > partition_id.txt
$ bash scripts/init.sh
Launch the terminal as many as Scrapy and start Scrapy. Like frontera, it restarts every time Scrapy finishes in a shell script.
$ cd /path/to/project/root
$ bash scripts/loop_scrapy.sh
Scrapy's logs are copied to scrapy_log/scrapy.log
. Since it may be rotated by python's logging module, you should use tail -F
when monitoring.
$ tail -F ~/workspace/frontera7/japanese_company_spider[0,1]/scrapy_log/scrapy.log