fawiki-preprocessor

A preprocessor for Persian Wikipedia articles

This script can be used to extract plaintext from Persian wikipedia articles extracted using WikiExtractor

Usage

To use the pre-processor, first check that you have the required libraries installed:

Hazm
BS4
NLTK

Then, use the guide below to pre-process your data. Note that the input file must be in the format of WikiExtractor's output

usage: preprocess.py [-h] [-i INPUT] [-o OUTPUT] [-n]

Required Parameters:
-i, --input  <PATH>  Path to the input file
-o, --output <PATH>  Path to the output file

Optional Parameters:
-h, --help               Help
-n, --normalize-output   Normalize the output text (default False)

Corpora

Two corpora (one normalized and one non-normalized) made from the 2021-01-20 Wikipedia dumps will be uploaded soon.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
evaluation		evaluation
README.md		README.md
preprocess.py		preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fawiki-preprocessor

Usage

Corpora

About

Releases

Packages

Languages

nadirkhanlou/fawiki-preprocessor

Folders and files

Latest commit

History

Repository files navigation

fawiki-preprocessor

Usage

Corpora

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages