This short script performs a grammar-dependent morphological processing of the raw text data. Such data can be either be a large text corpus used for computing the word embeddings or a smaller labeled dataset used for training the neural network according to a given downstream-task (e.g. named entity recognition). Using this script prior to any training process improves the quality of the original resources, utimately leading to an increase of the final performance.
The pre-trained word embeddings produced with this morphological processing are provided (under the CC-BY-4.0 license) at the following link.
NOTE: The results of this script (i.e. (1) word embeddings & (2) labled datasets) can be used to train the NER Tagger for reproducing and evaluating the performance boost. Further details can be found in the reference below. Please cite the reference if you happen to use it in your work.
- spaCy
- Python 3
Sajawel Ahmed and Alexander Mehler, "Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora" in Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018. [PDF]
@InProceedings{Ahmed:Mehler:2018,
author = {Sajawel Ahmed and Alexander Mehler},
title = {{Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora}},
booktitle = {Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA)},
location = {Orlando, Florida, USA},
pdf = {https://arxiv.org/pdf/1807.10675.pdf},
year = 2018
}