Skip to content

nadirkhanlou/fawiki-preprocessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

fawiki-preprocessor

A preprocessor for Persian Wikipedia articles

This script can be used to extract plaintext from Persian wikipedia articles extracted using WikiExtractor

Usage

To use the pre-processor, first check that you have the required libraries installed:

Then, use the guide below to pre-process your data. Note that the input file must be in the format of WikiExtractor's output

usage: preprocess.py [-h] [-i INPUT] [-o OUTPUT] [-n]

Required Parameters:
-i, --input  <PATH>  Path to the input file
-o, --output <PATH>  Path to the output file

Optional Parameters:
-h, --help               Help
-n, --normalize-output   Normalize the output text (default False)

Corpora

Two corpora (one normalized and one non-normalized) made from the 2021-01-20 Wikipedia dumps will be uploaded soon.

About

A preprocessor for Persian Wikipedia articles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages