LDA-topic-modeling: Using matrix and bigram-trigram

There are several ways to implement lda topic modeling, in this case the gensim package is used alongside with spacy and some data cleaning packages.

The documentations of main packages are available in following links:

Gensim: https://radimrehurek.com/gensim/
spaCy: https://spacy.io/api/doc
nltk: https://www.nltk.org/

Getting Started

gensim is a fast and reliable package for topic modeling but it may not be as accurate as we want when it is not implementing bigram-trigram so by using gensim.models.Phrases the bigram-trigram was implemented.

bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# !python3 -m spacy download en  # run in terminal once
def process_words(texts, stop_words=stop_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """Remove Stopwords, Form Bigrams, Trigrams and Lemmatization"""
    texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
    texts = [bigram_mod[doc] for doc in texts]
    texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
    texts_out = []
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    # remove stopwords once more after lemmatization
    texts_out = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts_out]    
    return texts_out

data_ready = process_words(data_words)  # processed Text Data!

One of the important things in this project was getting the topic for each one of the tweets, so we started using matrix to return all topics associated with a single tweet; also choosing an optimal number of topics is an important concern, in this case we can use gensim.models.coherencemodel.CoherenceModel inside a loop to achieve the best number of topics.

LDA = gensim.models.ldamodel.LdaModel
dict_ = corpora.Dictionary(data_words)
doc_term_matrix=[dict_.doc2bow(i) for i in data_ready]
coherence = []
for k in range(2,6):
    ldaModel = LDA(doc_term_matrix, num_topics=k, id2word=dict_,chunksize=1000, passes=1, random_state=0, update_every=1, eval_every=None)
        
    cm = gensim.models.coherencemodel.CoherenceModel(model=ldaModel, texts=data_ready, dictionary=dict_,
     coherence='c_v')
    coherence.append((k,cm.get_coherence()))
x_val = [x[0] for x in coherence]
y_val = [x[1] for x in coherence]
maxCoh = max(y_val)

NumOfTopics = 0
for i in range(4):
    if y_val[i] == maxCoh:
        NumOfTopics = i+1

print(NumOfTopics)

Gensim lda model parameters

for more accurate explanations visit this page.

alpha and beta are two parameters that were not well discussed, as it is said in this topic a low alpha value places more weight on having each document composed of only a few dominant topics (whereas a high value will return many more relatively dominant topics). Similarly, a low beta value places more weight on having each topic composed of only a few dominant words.

for more information on alpha and beta visit this page.

Requirements

Python 3.6+ is required. The following packages are required:

Notes

Latent Dirichlet allocation is described in Blei et al. (2003) and Pritchard et al. (2000).

important links

Documentation: https://radimrehurek.com/gensim/
Source code: https://github.com/Mohi294/LDA-topic-modeling/
Issue tracker: https://github.com/Mohi294/LDA-topic-modeling/issues

Other implementations

scikit-learn's LatentDirichletAllocation (uses online variational inference)
lda

License

Gensim is licensed under the OSI-approved GNU LGPLv2. 1 license.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
lda.ipynb		lda.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LDA-topic-modeling: Using matrix and bigram-trigram

Getting Started

Gensim lda model parameters

Requirements

Notes

important links

Other implementations

License

About

Releases

Packages

Languages

Mohi294/LDA-topic-modeling

Folders and files

Latest commit

History

Repository files navigation

LDA-topic-modeling: Using matrix and bigram-trigram

Getting Started

Gensim lda model parameters

Requirements

Notes

important links

Other implementations

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages