This work contributes to extensively assessing the impact of preprocessing tasks on the named entity recognition success in Indonesian text at various feature dimensions and possible interactions among these tasks.
- Contractions Expansion
- Lowercase Conversion
- Stemming
- Number to Words Conversion
- Hyphen and Comma Splitting
- The word
- The length of the word or number of characters
- Prefixes and suffixes of the word of varying lengths
- The word in lowercase
- Stemmed version of the word, which deletes all vowels along with g, y, n from the end of the word, but leaves at least a 2 character long stem
- If the word is a punctuation mark
- If the word is a digit
- Features mentioned above for the previous word, the following word, and the words two places before and after
- Word POS tag
- If the word is at the beginning of the sentence (BOS) or the end of the sentence (EOS) or neither
- Both Linux and Windows are supported. Linux is recommended for performance and compatibility reasons.
- 64-bit Python 3.7 installation.
- I recommend sklearn-crfsuite 0.36, which I used for all experiments.
- Download singgalang.tsv and store it in the
data
directory. - Download all_indo_man_tag_corpus_model.crf.tagger and store it in the
pre-trained-model
directory.
python main.py