CORA/wikipedia_preprocess at main · AkariAsai/CORA

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
build_db.py		build_db.py
build_dpr_w100_data.py		build_dpr_w100_data.py
create_w100_data_japanese.py		create_w100_data_japanese.py
create_w100_data_khmer.py		create_w100_data_khmer.py
create_w100_data_thai.py		create_w100_data_thai.py
doc_db.py		doc_db.py
utils.py		utils.py

README.md

Wikipedia preprocessing code

This directory contains the code to preprocess Wikipedias.
First you need to download the Wikipedia dumps following 1. Download Wikipedia dumps, preprocess and store the data into a sqlite DB file (2. Store data into database), and then create a context file by splitting each article into 100 token long and write to a tsv file (3. Create a DPR context file).

1. Download Wikipedia dumps

First, you need to download Wikipedia dump from the Wikimedia website. They only keep the most recent dumps, so if you are looking for dumps from certain timestamps, you have to check the archive.

e.g., all of the related dump for Japanese Wikipedia 20190201 can be seen and downloaded here. jawiki-20190201-pages-articles-multistream.xml.bz2 includes the article text.

Run Wikiextractor to extract plain text

We usually run Wikiextractor to preprocess and extract plain text data from the Wikipedia dump.

git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
python WikiExtractor.py /path/to/your/xxwiki-20190201-pages-articles-multistream.xml.bz2 --filter_disambig_pages --json -o /path/to/output/directory -s

you can add -c (--compress) option to compress the output files using bzip.

2. Store data into database

You can store the processed text data into sqlite database.

python build_db.py /path/to/preprocessed/data/dir /path/to/db/file.db

3. Create a DPR context file

DPR first splits each article into 100-token length instead of using the original paragraphs or articles as is. Run the command below to generate a tsv file where each line contains 100-token length Wikipedia paragraphs.

python build_dpr_w100_data.py --db_path /path/to/db/file.db --tsv_path /path/to/output/file.tsv

Japanese and Thai does not use white spaces for segmentation. For those language, you need to run the special scripts below, which tokenize the input sequences and generate 100-token document chunks as in other languages.

For Japanese: create_w100_data_japanese.py
For Thai: create_w100_data_thai.py

References

List of Wikipedias: you can check the statistics of each Wikipedia from the Details table section.
Wikimedia Archive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wikipedia_preprocess

wikipedia_preprocess

README.md

Wikipedia preprocessing code

1. Download Wikipedia dumps

Run Wikiextractor to extract plain text

2. Store data into database

3. Create a DPR context file

References

Files

wikipedia_preprocess

Directory actions

More options

Directory actions

More options

Latest commit

History

wikipedia_preprocess

Folders and files

parent directory

README.md

Wikipedia preprocessing code

1. Download Wikipedia dumps

Run Wikiextractor to extract plain text

2. Store data into database

3. Create a DPR context file

References