Myanmar Wiki Crawler

Simple program for crawling Burmese Wikipedia using media wiki API by querying page from "က" first and sequentially crawling until specified size reaches or no more pages to crawl.

[TOC]

Getting started

Install requirements and you are good to go.

pip install -r requirements.txt

Step-by-step Procedure

This program will first query for pages using Media Wiki API to get page titles in batches (500 pages per batch - maximum page limit allowed by Media Wiki).
It then uses these titles to make html request to individual page and collect text from content field of that page.
It then stores text into file by using sentence-level segmentation and regex to store only Burmese characters. (from unicode u1000 to u1100).
It stores one text file per batch using batch index which starts from 0.
This program will automatically resume from last batch it saved before stopping using meta.json.

Usage


python extract.py -h

usage: extract.py [-h] [-l LOG_DIR] [--max_size MAX_SIZE]
                  [--output_dir OUTPUT_DIR]

Web Crawler for Burmese wiki.

optional arguments:
  -h, --help            show this help message and exit
  -l LOG_DIR, --log_dir LOG_DIR
                        Specify logs directory for errors (default: logs)
  --max_size MAX_SIZE   Specify max size (in MB) to crawl wiki. (default: 1000)
  --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                        Output directory for storing corpus (default: results)

TODOS

Remove max_size limit.
Better filtering of burmese characters.
Optimize corpus storing.

License

MIT license

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
utils		utils
.gitignore		.gitignore
License.md		License.md
README.md		README.md
extract.py		extract.py
requirements.txt		requirements.txt
test.py		test.py
test.txt		test.txt
test_trdg.py		test_trdg.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Myanmar Wiki Crawler

Getting started

Step-by-step Procedure

Usage

TODOS

License

About

Releases

Packages

Languages

License

zawlinnnaing/my-wiki-crawler

Folders and files

Latest commit

History

Repository files navigation

Myanmar Wiki Crawler

Getting started

Step-by-step Procedure

Usage

TODOS

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages