topic-modelling-lda

A setup for topic modelling with LDA on a German dataset.

Dataset

The data that was used for the topic modelling task: https://github.com/tblock/10kGNAD

The dataset contains 10k news articles from an Austrian online news platform. Each article is assigned to one specific category (out of 9). Since I am using an unsupervised topic moelling method here, I will only be using the article texts.

Before pre-processing the data I removed the classification tags and proceeded with the texts only.

Pre-processing

In the pre-processing step I am removing noise first, before I normalize the text by applying lemmatization, transforming to lower cases and removing stopwords. After tokenizing the text, one can choose to create either a bag of words or a tf-idf-model for further processing. In my experiments I consitently ended up with higher coherence values using the bag of words method.

Model

In this setup I am using the LDA model only. To find the optimal model, I loop over a range of possible topic numbers and choose the one with the highest coherence score.

Results for the best setup

The coherence score for the best model is 0.54757210059955.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
img		img
README.md		README.md
model_training.py		model_training.py
pre_processing.py		pre_processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

topic-modelling-lda

Dataset

Pre-processing

Model

Results for the best setup

About

Releases

Packages

Languages

agademic/topic-modelling-lda

Folders and files

Latest commit

History

Repository files navigation

topic-modelling-lda

Dataset

Pre-processing

Model

Results for the best setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages