This project scrapes/crawls post content and comments from PTT website, and implements neural CKIP Chinese NLP tools on the scraped data asynchronously.
-
Python version
python >= 3.10
-
Clone repository
git clone git@github.com:Taiwan-Social-Media-Corpus/async-scraptt.git
-
Install Requirement
-
pip
cd async-scraptt && pip install -r requirements.txt
-
poetry
cd async-scraptt && poetry shell && poetry install
- Commands
scrapy crawl ptt -a boards=BOARDS [-a fetch_all=BOOLEAN]
[-a index_from=NUMBER -a index_to=NUMBER]
[-a scrape_from=YEAR] [-a data_dir=PATH]
positional arguments:
-a boards=BOARDS ptt board name (e.g. Soft_Job)
-a index_from=NUMBER -a index_to=NUMBER html index number from a ptt board
-a fetch_all=BOOLEAN scrap all posts if true
-a scrape_from=YEAR scrap all posts from a given year
-a data_dir=PATH output file path (default: ./data)
-
Crawl all the posts of a board:
scrapy crawl ptt -a boards=Soft_Job -a fetch_all=True
-
Crawl all the posts of a board from a year in the past:
scrapy crawl ptt -a boards=Soft_Job -a scrape_from=2022
-
Crawl the posts of a board based on html indexes:
scrapy crawl ptt -a boards=Soft_Job -a index_from=1722 -a index_to=1723
Please make sure the number of
index_from
is greater thanindex_to
. -
Crawl the posts of multiple boards. For example:
scrapy crawl ptt -a boards=Soft_Job,Gossiping -a index_from=1722 -a index_to=1723
Note: the comma in the argument
boards
cannot have spaces. It cannot beboards=Soft_Job, Baseball
orboards=["Soft_Job", "Baseball"]
.
A Docker setup is provided for the crawler.
To run the crawler, go to the docker-compose.yml
file to edit the command:
version: "3"
services:
scraptt:
build: .
# define your crawler here!
command: bash -c "poetry run scrapy crawl ptt -a boards=Soft_Job -a index_from=1500 -a index_to=1500"
volumes:
- "./data:/app/data"
Feel free to change the command as long as it follows the command format as stated above.
Now start the crawler:
docker compose up
If you have any suggestion or question, please do not hesitate to email us at shukai@gmail.com or lixingyang.dev@gmail.com