preprocessing-crf-ner

Description

This work contributes to extensively assessing the impact of preprocessing tasks on the named entity recognition success in Indonesian text at various feature dimensions and possible interactions among these tasks.

Preprocessing Procedures

Contractions Expansion
Lowercase Conversion
Stemming
Number to Words Conversion
Hyphen and Comma Splitting

Feature Extraction

The word
The length of the word or number of characters
Prefixes and suffixes of the word of varying lengths
The word in lowercase
Stemmed version of the word, which deletes all vowels along with g, y, n from the end of the word, but leaves at least a 2 character long stem
If the word is a punctuation mark
If the word is a digit
Features mentioned above for the previous word, the following word, and the words two places before and after
Word POS tag
If the word is at the beginning of the sentence (BOS) or the end of the sentence (EOS) or neither

Requirements

Both Linux and Windows are supported. Linux is recommended for performance and compatibility reasons.
64-bit Python 3.7 installation.
I recommend sklearn-crfsuite 0.36, which I used for all experiments.
Download singgalang.tsv and store it in the data directory.
Download all_indo_man_tag_corpus_model.crf.tagger and store it in the pre-trained-model directory.

Usage

python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmarking.py		benchmarking.py
main.py		main.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

preprocessing-crf-ner

Description

Preprocessing Procedures

Feature Extraction

Requirements

Usage

About

Releases

Packages

Languages

License

exemuel/preprocessing-crf-ner

Folders and files

Latest commit

History

Repository files navigation

preprocessing-crf-ner

Description

Preprocessing Procedures

Feature Extraction

Requirements

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages