Scrapy with Frontera

Scrapy + Frontera (A crawling project) - translated from the original Japanese version. (Thanks RY-2718)

Dependencies:

Scrapy
Frontera
Apache Kafka
Apache HBase
Twisted Python
- docs/twisted-change.md (Change according to this instructions)

Introduction

Python package dependencies

virtualenv - Please do not do it unless you separate the environment with etc. Solve dependencies for distibuted / kafka / hbase, then uninstall and install via the master.

$ pip install scrapy colorlog msgpack-python frontera[distributed,kafka,hbase]
$ pip uninstall frontera
$ pip install pip install git+https://github.com/scrapinghub/frontera.git

Edit the Configuration File

Settings for scrapy's behavior are '/crawler/settings.py'. Frontera's settings are '/frontier/common.py', '/frontier/*_settings.py'. logging settings are in 'logging.conf'

The items to be set at a minimum are listed below.

/crawler/settings.py

BUCKET_NAME # S3[Bucket name]

/frontier/common.py

SPIDER_FEED_PARTITIONS # number of spiders(Scrapy) 
SPIDER_LOG_PARTITIONS #  number of workers(Frontera)

KAFKA_LOCATION # Kafka location: e.g., 'localhost:9092'
# Settings related to kafka:
# All is fine as long as it matches between Scrapy and Frontera, but it seems reasonable to slightly change the default name.
SPIDER_LOG_DBW_GROUP
SPIDER_LOG_SW_GROUP
SCORING_LOG_DBW_GROUP
SPIDER_FEED_GROUP
SPIDER_LOG_TOPIC
SPIDER_FEED_TOPIC
SCORING_LOG_TOPIC

/frontier/*_settings.py

HBASE_THRIFT_HOST = 'localhost' # HBase location
HBASE_THRIFT_PORT = 9090 # Port number where HBase's Thrift client runs, default is 9090
HBASE_METADATA_TABLE = 'metadata' # The table name created by Frontera. If it is not created, Frontera creates it automatically.
HBASE_QUEUE_TABLE = 'queue' # The table name created by Frontera. If it is not created, Frontera creates it automatically.

Kafka, HBase settings

Introduce Kafka, create a topic (match SPIDER_LOG_TOPIC, SPIDER_FEED_TOPIC, SCORING_LOG_TOPIC above).

An example command is shown below. For details, please refer to kafka document.

$ /path/to/kafka/bin/kafka-topics.sh --create --topic frontier-done --replication-factor 1 --partitions 1 --zookeeper localhost:2181
$ /path/to/kafka/bin/kafka-topics.sh --create --topic frontier-score --replication-factor 1 --partitions 1 --zookeeper localhost:2181
$ /path/to/kafka/bin/kafka-topics.sh --create --topic frontier-todo --replication-factor 1 --partitions 2 --zookeeper localhost:2181

Also, introduce HBase and create a namespace called crawler.

An example command is shown below. For details, refer to HBase document.

$ hbase shell
> create_namespace 'crawler'

How to move

Frontera

It is assumed that Kafka + zookeeper is running.

Launch two terminals and start each worker of frontera. It restarts every time frontera's worker terminates in run _ *. sh..

$ cd /path/to/project/root
$ bash scripts/run_db.sh

$ cd /path/to/project/root
$ bash scripts/run_strategy.sh

At the time of termination, we terminate frontera as follows. I try to hit a script to stop frontera's loop.

$ cd /path/to/project/root
$ bash scripts/kill_frontera_loop.sh

Scrapy

It is assumed that frontera worker is running.

First Time Setup

Create partition_id.txt in the project root as follows and execute 'scripts/init.sh'. The number now is the ID of Scrapy managed by Frontera. In this example, the ID of Scrapy is 0.

$ cd /path/to/project/root
$ echo 0 > partition_id.txt
$ bash scripts/init.sh

Procedure for starting Scrapy

Launch the terminal as many as Scrapy and start Scrapy. Like frontera, it restarts every time Scrapy finishes in a shell script.

$ cd /path/to/project/root
$ bash scripts/loop_scrapy.sh

Scrapy's logs are copied to scrapy_log/scrapy.log. Since it may be rotated by python's logging module, you should use tail -F when monitoring.

$ tail -F ~/workspace/frontera7/japanese_company_spider[0,1]/scrapy_log/scrapy.log

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
crawler		crawler
cron		cron
doc		doc
frontera_log		frontera_log
frontier		frontier
scrapy_log		scrapy_log
scripts		scripts
seeds		seeds
txt		txt
urls		urls
.gitignore		.gitignore
README.md		README.md
logging.conf		logging.conf
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapy with Frontera

Dependencies:

Introduction

Python package dependencies

Edit the Configuration File

/crawler/settings.py

/frontier/common.py

/frontier/*_settings.py

Kafka, HBase settings

How to move

Frontera

Scrapy

First Time Setup

Procedure for starting Scrapy

About

Releases

Packages

Languages

mjbarlow/frontera_crawler

Folders and files

Latest commit

History

Repository files navigation

Scrapy with Frontera

Dependencies:

Introduction

Python package dependencies

Edit the Configuration File

/crawler/settings.py

/frontier/common.py

/frontier/*_settings.py

Kafka, HBase settings

How to move

Frontera

Scrapy

First Time Setup

Procedure for starting Scrapy

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages