lda.Rmd

---
title: "Topic Modeling with R"
author: "Wouter van Atteveldt"
date: "June 3, 2016"
output: pdf_document
---

```{r, include=FALSE}
library(knitr)
opts_chunk$set(fig.path = "figures_lda/")
head = function(...) knitr::kable(utils::head(...))
```

Topic modelling techniques such as Latent Dirichlet Allocation (LDA) can be a usefull tool for social scientists to analyze large amounts of natural language data. Algorithms for LDA are available in R, for instance in the `topicmodels` package. In this howto we demonstrate several function in the `corpustools` package that facilitate the use of LDA using the `topicmodels` package.

As a starting point we use a Document Term Matrix (dtm) in the `DocumentTermMatrix` format offered in the `tm` package. Note that we also offer a howto for creating the dtm. 

```{r, warning=F, message=F}
library(corpustools)
data(sotu) # state of the union speeches by Barack Obama and George H. Bush.
head(sotu.tokens)
dtm = dtm.create(documents=sotu.tokens$aid, terms=sotu.tokens$lemma, filter=sotu.tokens$pos1 %in%  c('N','M','A'))
dtm
```

Not all terms are equally informative of the underlying semantic structures of texts, and some terms are rather useless for this purpose. For interpretation and computational purposes it is worthwhile to delete some of the less usefull words from the dtm before fitting the LDA model.  As seen from the red message lines above, the dtm.create automatically uses some filtering of terms, but it can be good to customize this for your research.

Now we are ready to fit the model! We made a wrapper called `lda.fit` for the `LDA` function in the `topicmodels` package. This wrapper doesn't do anything interesting, except for deleting empty columns/rows from the dtm, which can occur after filtering out words. 

The main input for `topmod.lda.fit` is:
- the document term matrix
- K: the number of topics (this has to be defined a priori)
- Optionally, it can be usefull to increase the number of iterations. This takes more time, but increases performance.

```{r}
set.seed(12345)
m = lda.fit(dtm, K=20, num.iterations=100)
terms(m, 10)[,1:5] # show first 5 topics, with ten top words per topic
```

We now have a fitted lda model. The `terms` function shows the most prominent words for each topic (we only selected the first 5 topics for convenience). 

Visualizing LDA models
===

The easiest way to visualize an LDA model is using the LDAvis package:

```{r, eval=F}
library(LDAvis)
serVis(ldavis_json(m, dtm))
```

(run interactively to see results)

Another way to visualize LDA results is to plot word use and metadata about the indivual topics:
E.g., how much attention they get over time, how much they are used by different sources (e.g., people, newspapers, organizations).
To do so, we first need to match the metadata to the documents in the model to make sure they are in the same order:

```{r}
head(sotu.meta)
meta = sotu.meta[match(m@documents, sotu.meta$id),]
```

We can now do some plotting. First, we can make a wordcloud for a more fancy (and actually quite informative and intuitive) representation of the top words of a topic.

```{r, message=F, warning=FALSE, fig.width=6, fig.height=6}
lda.plot.wordcloud(m, topic_nr=1)
```
```{r, eval=F}
lda.plot.wordcloud(m, topic_nr=2)
```

With `lda.plot.time` and `lda.plot.category`, we can plot the salience of the topic over time and for a given categorical variable.

```{r, eval=F}
lda.plot.time(m, 1, meta$date, date_interval='year')
lda.plot.category(m, 1, meta$headline)
```

It can be usefull to print all this information together. That is what the following function does.

```{r, message=F, warning=FALSE, fig.width=10, fig.height=10}
lda.plot.topic(m, 1, meta$date, meta$headline, date_interval='year')
```
```{r, eval=F}
lda.plot.topic(m, 2, meta$date, meta$headline, date_interval='year')
```

For further substantive analysis, we can also create a data frame containing the topic proportion for each document:

```{r}
docs = topics.per.document(m)
docs = merge(meta, docs)
head(docs)
```

Calculating perplexity
===

Although in the end the best guide to determining the amount and interpretation of topics is expert judgment,
it can be useful to plot the 'perplexity' (model error) of various settings for K (and alpha).

The first step is to create separate data for fitting the model and validating it to avoid overfitting,
basically using a split-half technique.
Although it is easy to subset the dtm based on a sample of row names (ids),
care must be taken that there are no zero-only rows and columns, and that the vocabulary (colnames) of the validation dtm
matches that of the dtm used to fit the model: 

```{r}
ids = rownames(dtm)
fit_ids = sample(ids, length(ids) / 2)
dtm.subset <- function(dtm, rows, cols=colnames(dtm)) {
  dtm = dtm[rownames(dtm) %in% rows, colnames(dtm) %in% cols]
  dtm = dtm[row_sums(dtm) > 0, col_sums(dtm) > 0]
  weightTf(dtm)
}
dtm_fit = dtm.subset(dtm, fit_ids)
dtm_validate = dtm.subset(dtm, setdiff(ids,  fit_ids), colnames(dtm_fit))
```

Now, we can calculate the perplexity by fitting a number of models for each K,
adding each to a data frame of perplexity scores:

```{r}
perplex = NULL
for (k in seq(10, 50, by=10)) {
  for (i in 1:2) {
    m = lda.fit(dtm_fit, K=k, alpha=.1)
    p = perplexity(m, dtm_validate)
    perplex = rbind(perplex, c(k=k, i=i, p=p))
  }
}
head(perplex)
```


Now we can plot the average perplexity per K in a scree plot:

```{r}
perplex = as.data.frame(perplex)
p = aggregate(perplex["p"], perplex["k"], mean)
library(ggplot2)
ggplot(p, aes(x=k, y=p, )) + geom_line()  +geom_point()
```

This suggests that it could be interesting to inspect the region around k=10 better, as perplexity actually increased going to k=20.
Perplexity is also still decreasing at k=50, but there seems to be an elbow point at k=40.
Note that normally you should use more iterations per k and also test intermediate k values,
increasing computational complexity especially for very large datasets.