NeMig represents a bilingual news collection and knowledge graphs on the topic of migration. The news corpora in German and English were collected from online media outlets from Germany and the US, respectively. NeMIg contains rich textual and metadata information, sentiment and political orientation annotations, as well as named entities extracted from the articles' content and metadata and linked to Wikidata. The corresponding knowledge graphs (NeMigKG) built from each corpus are expanded with up to two-hop neighbors from Wikidata of the initial set of linked entities.
NeMigKG comes in four flavors, for both the German, and the English corpora:
- Base NeMigKG: contains literals and entities from the corresponding annotated news corpus;
- Entities NeMigKG: derived from the Base NeMIg by removing all literal nodes, it contains only resource nodes;
- Enriched Entities NeMigKG: derived from the Entities NeMig by enriching it with up to two-hop neighbors from Wikidata, it contains only resource nodes and Wikidata triples;
- Complete NeMigKG: the combination of the Base and Enriched Entities NeMig, it contains both literals and resources.
The directory structure of new project looks like this:
├── configs <- Hydra configuration files
│ ├── dataset <- Dataset configs
│ ├── entity_filtering <- Entity filtering configs
│ ├── experiment <- Experiment configs
│ ├── hydra <- Hydra configs
│ ├── kg_construction <- Knowledge graph construction configs
│ ├── kg_serialization <- Koowledge graph serialization configs
│ ├── named_entity_linking <- Named entity linking configs
│ ├── named_entity_recognition <- Named entity recognition configs
│ ├── sentiment_classification <- Sentiment classification configs
│ │
│ ├── pipeline.yaml <- Main config for the pipeline
│
├── data <- Project data
│
├── logs <- Logs generated by hydra loggers
│
├── notebooks <- Jupyter notebooks
│
├── scripts <- Shell scripts
│
├── src <- Source code
│ ├── dataset <- Dataset creation and processing
│ ├── entity_filtering <- Entity filtering model
│ ├── kg_construction <- Knowledge graph construction model
│ ├── kg_serialization <- Koowledge graph serialization model
│ ├── named_entity_linking <- Named entity linking model
│ ├── named_entity_recognition <- Named entity recognition model
│ ├── sentiment_classification <- Sentiment classification model
│ ├── utils <- Utility scripts
│ │
│ └── pipeline.py <- Run pipeline
│
├── .gitignore <- List of files ignored by git
├── requirements.txt <- File for installing python dependencies
└── README.md
Install dependencies
# clone project
git clone https://github.com/andreeaiana/nemig
cd nemig
# [OPTIONAL] create conda environment
conda create -n nemig_env python=3.9
conda activate nemig_env
# install requirements
pip install -r requirements.txt
Download the mGENRE model as described in mGENRE needed for running the entity linking model.
Run pipeline with chosen experiment configuration from configs/experiment/
python main.py experiment=experiment_name.yaml
You can override any parameter from command line like this
python src/main.py language='de' kg_construction.k_hop=1
Run the Subtopic Modelling notebook to extract sub-topics from the data and integrate the results in the pipeline.
The chosen version of NeMig will be constructed and cached in the cache folder. NeMigKG is serialized in N-Triple format, and the resulting files are placed in the kg folder.
A sample of the annotated news corpora used to construct the knowledge graphs are available in the cache folder. Due to copyright policies, this sample does not contain the body of the articles. A full version of the news corpus is available upon request.
The anonymized user data for each dataset is available in the user data folder.
NeMigKG is hosted on Zenodo. All files are gzipped and in N-Triples format.
A sample of the triple files for can be found in the kg folder. Due to copyright policies, these samples do not contain the body of the news articles.
The code is licensed under the MIT License. The data files are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
If you use the dataset, please cite:
@dataset{iana_andreea_2022_7908392,
author = {Iana, Andreea and
Alam, Mehwish and
Grote, Alexander and
Nikolajevic, Nevena and
Ludwig, Katharina and
Müller, Philipp and
Weinhardt, Christof and
Paulheim, Heiko},
title = {{NeMig - A Bilingual News Collection and Knowledge
Graph about Migration}},
month = dec,
year = 2022,
publisher = {Zenodo},
version = {v1.0.1},
doi = {10.5281/zenodo.7442424},
url = {https://doi.org/10.5281/zenodo.7442424}
}