Simple program for crawling Burmese Wikipedia using media wiki API by querying page from "က" first and sequentially crawling until specified size reaches or no more pages to crawl.
[TOC]
Install requirements and you are good to go.
pip install -r requirements.txt
-
This program will first query for pages using Media Wiki API to get page titles in batches (500 pages per batch - maximum page limit allowed by Media Wiki).
-
It then uses these titles to make html request to individual page and collect text from content field of that page.
-
It then stores text into file by using sentence-level segmentation and regex to store only Burmese characters. (from unicode u1000 to u1100).
-
It stores one text file per batch using batch index which starts from 0.
-
This program will automatically resume from last batch it saved before stopping using
meta.json
.
python extract.py -h
usage: extract.py [-h] [-l LOG_DIR] [--max_size MAX_SIZE]
[--output_dir OUTPUT_DIR]
Web Crawler for Burmese wiki.
optional arguments:
-h, --help show this help message and exit
-l LOG_DIR, --log_dir LOG_DIR
Specify logs directory for errors (default: logs)
--max_size MAX_SIZE Specify max size (in MB) to crawl wiki. (default: 1000)
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
Output directory for storing corpus (default: results)
-
Remove max_size limit.
-
Better filtering of burmese characters.
-
Optimize corpus storing.