German Morphological Processing for Word Embeddings & Named Entity Recognition

This short script performs a grammar-dependent morphological processing of the raw text data. Such data can be either be a large text corpus used for computing the word embeddings or a smaller labeled dataset used for training the neural network according to a given downstream-task (e.g. named entity recognition). Using this script prior to any training process improves the quality of the original resources, utimately leading to an increase of the final performance.

The pre-trained word embeddings produced with this morphological processing are provided (under the CC-BY-4.0 license) at the following link.

NOTE: The results of this script (i.e. (1) word embeddings & (2) labled datasets) can be used to train the NER Tagger for reproducing and evaluating the performance boost. Further details can be found in the reference below. Please cite the reference if you happen to use it in your work.

Requirements

spaCy
Python 3

Data

Unlabeled text corpora

Labeled datasets for German named entity recognition

Cite

Sajawel Ahmed and Alexander Mehler, "Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora" in Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018. [PDF]

BibTeX

@InProceedings{Ahmed:Mehler:2018,
author		= {Sajawel Ahmed and Alexander Mehler},
title		= {{Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora}},
booktitle	= {Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA)},
location	= {Orlando, Florida, USA},
pdf		= {https://arxiv.org/pdf/1807.10675.pdf},
year		= 2018
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
out		out
LICENSE		LICENSE
README.md		README.md
morphProcessing.py		morphProcessing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

German Morphological Processing for Word Embeddings & Named Entity Recognition

Requirements

Data

Unlabeled text corpora

Labeled datasets for German named entity recognition

Cite

BibTeX

About

Releases

Packages

Languages

License

FID-Biodiversity/GermanWordEmbeddings-NER

Folders and files

Latest commit

History

Repository files navigation

German Morphological Processing for Word Embeddings & Named Entity Recognition

Requirements

Data

Unlabeled text corpora

Labeled datasets for German named entity recognition

Cite

BibTeX

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages