Skip to content

Latest commit

 

History

History
556 lines (424 loc) · 19.3 KB

corpus.md

File metadata and controls

556 lines (424 loc) · 19.3 KB
title output
Corpus analysis: the document-term matrix
html_document

=========================================

(C) 2015 Wouter van Atteveldt, license: [CC-BY-SA]

The most important object in frequency-based text analysis is the document term matrix. This matrix contains the documents in the rows and terms (words) in the columns, and each cell is the frequency of that term in that document.

In R, these matrices are provided by the tm (text mining) package. Although this package provides many functions for loading and manipulating these matrices, using them directly is relatively complicated.

Fortunately, the RTextTools package provides an easy function to create a document-term matrix from a data frame. To create a term document matrix from a simple data frame with a 'text' column, use the create_matrix function (with removeStopwords=F to make sure all words are kept):

library(RTextTools)
input = data.frame(text=c("Chickens are birds", "The bird eats"))
m = create_matrix(input$text, removeStopwords=F)

We can inspect the resulting matrix m using the regular R functions to get e.g. the type of object and the dimensionality:

class(m)
## [1] "DocumentTermMatrix"    "simple_triplet_matrix"
dim(m)
## [1] 2 6
m
## <<DocumentTermMatrix (documents: 2, terms: 6)>>
## Non-/sparse entries: 6/6
## Sparsity           : 50%
## Maximal term length: 8
## Weighting          : term frequency (tf)

So, m is a DocumentTermMatrix, which is derived from a simple_triplet_matrix as provided by the slam package. Internally, document-term matrices are stored as a sparse matrix: if we do use real data, we can easily have hundreds of thousands of rows and columns, while the vast majority of cells will be zero (most words don't occur in most documents). Storing this as a regular matrix would waste a lot of memory. In a sparse matrix, only the non-zero entries are stored, as 'simple triplets' of (document, term, frequency).

As seen in the output of dim, Our matrix has only 2 rows (documents) and 6 columns (unqiue words). Since this is a fairly small matrix, we can visualize it using as.matrix, which converts the 'sparse' matrix into a regular matrix:

as.matrix(m)
##     Terms
## Docs are bird birds chickens eats the
##    1   1    0     1        1    0   0
##    2   0    1     0        0    1   1

Stemming and stop word removal

So, we can see that each word is kept as is. We can reduce the size of the matrix by dropping stop words and stemming (changing a word like 'chickens' to its base form or stem 'chicken'): (see the create_matrix documentation for the full range of options)

m = create_matrix(input$text, removeStopwords=T, stemWords=T, language='english')
dim(m)
## [1] 2 3
as.matrix(m)
##     Terms
## Docs bird chicken eat
##    1    1       1   0
##    2    1       0   1

As you can see, the stop words (the and are) are removed, while the two verb forms of to eat are joined together.

In RTextTools, the language for stemming and stop words can be given as a parameter, and the default is English. Note that stemming works relatively well for English, but is less useful for more highly inflected languages such as Dutch or German. An easy way to see the effects of the preprocessing is by looking at the colSums of a matrix, which gives the total frequency of each term:

colSums(as.matrix(m))
##    bird chicken     eat 
##       2       1       1

For more richly inflected languages like Dutch, the result is less promising:

text = c("De kip eet", "De kippen hebben gegeten")
m = create_matrix(text, removeStopwords=T, stemWords=T, language="dutch")
colSums(as.matrix(m))
##   eet geget   kip  kipp 
##     1     1     1     1

As you can see, de and hebben are correctly recognized as stop words, but gegeten (eaten) and kippen (chickens) have a different stem than eet (eat) and kip (chicken). German gets similarly bad results.

Getting a Term-Document matrix from AmCAT

AmCAT can automatically lemmatize text. Before we can use it, we need to connect with a valid username and password:

library(amcatr)
conn = amcat.connect("http://preview.amcat.nl")

Now, we can use the amcat.gettokens

sentence = "Chickens are birds. The bird eats"
t = amcat.gettokens(conn, sentence=as.character(sentence), module="corenlp_lemmatize")
## GET http://preview.amcat.nl/api/v4/tokens/?module=corenlp_lemmatize&page_size=1&format=csv&sentence=Chickens%20are%20birds.%20The%20bird%20eats
t
##       word sentence pos   lemma offset aid id pos1
## 1 Chickens        1 NNS chicken      0  NA  1    N
## 2      are        1 VBP      be      9  NA  2    V
## 3    birds        1 NNS    bird     13  NA  3    N
## 4        .        1   .       .     18  NA  4    .
## 5      The        2  DT     the     20  NA  5    D
## 6     bird        2  NN    bird     24  NA  6    N
## 7     eats        2 VBZ     eat     29  NA  7    V

As you can see, this provides real-time lemmatization and Part-of-Speech tagging using the Stanford CoreNLP toolkit: 'are' is recognized as V(erb) and has lemma 'be'. To create a term-document matrix from a list of tokens, we can use the dtm.create function. Since the token list is a regular R data frame, we can use normal selection to e.g. select only the verbs and nouns:

library(corpustools)
dtm = dtm.create(documents=t$sentence, terms=t$lemma, filter=t$pos1 %in% c('V', 'N'), minfreq=0)
as.matrix(dtm)
##     Terms
## Docs chicken bird eat
##    1       1    1   0
##    2       0    1   1

Loading and analysing a larger dataset from AmCAT

Normally, rather than ask for a single ad hoc text to be parsed, we would upload a selection of articles to AmCAT, after which we can call the analysis for all text at once. This can be done from R using the amcat.upload.articles function, which we now demonstrate with a single article but which can also be used to upload many articles at once:

articles = data.frame(text = "John is a great fan of chickens, and so is Mary", date="2001-01-01", headline="test")

aset = amcat.upload.articles(conn, project = 1, articleset="Test Florence", medium="test", 
                             text=articles$text, date=articles$date, headline=articles$headline)
## Created articleset 18317: Test Florence in project 1
## Uploading 1 articles to set 18317

And we can then lemmatize this article and download the results directly to R using amcat.gettokens:

amcat.gettokens(conn, project=1, articleset = aset, module = "corenlp_lemmatize")
## GET http://preview.amcat.nl/api/v4/projects/1/articlesets/18317/tokens/?page=1&module=corenlp_lemmatize&page_size=1&format=csv
## GET http://preview.amcat.nl/api/v4/projects/1/articlesets/18317/tokens/?page=2&module=corenlp_lemmatize&page_size=1&format=csv
##        word sentence pos   lemma offset       aid id pos1
## 1      John        1 NNP    John      0 114440106  1    M
## 2        is        1 VBZ      be      5 114440106  2    V
## 3         a        1  DT       a      8 114440106  3    D
## 4     great        1  JJ   great     10 114440106  4    A
## 5       fan        1  NN     fan     16 114440106  5    N
## 6        of        1  IN      of     20 114440106  6    P
## 7  chickens        1 NNS chicken     23 114440106  7    N
## 8         ,        1   ,       ,     31 114440106  8    .
## 9       and        1  CC     and     33 114440106  9    C
## 10       so        1  RB      so     37 114440106 10    B
## 11       is        1 VBZ      be     40 114440106 11    V
## 12     Mary        1 NNP    Mary     43 114440106 12    M

And we can see that e.g. for "is" the lemma "be" is given. Note that the words are not in order, and the two occurrences of "is" are automatically summed. This can be switched off by giving drop=NULL as extra argument.

For a more serious application, we will use an existing article set: set 16017 in project 559, which contains the state of the Union speeches by Bush and Obama (each document is a single paragraph) The analysed tokens for this set can be downloaded with the following command:

sotu.tokens = amcat.gettokens(conn, project=559, articleset = 16017, module = "corenlp_lemmatize", page_size = 100)

This data is also available directly from the semnet package:

data(sotu)
nrow(sotu.tokens)
## [1] 91473
head(sotu.tokens, n=20)
##          word sentence  pos      lemma offset       aid id pos1 freq
## 1          It        1  PRP         it      0 111541965  1    O    1
## 2          is        1  VBZ         be      3 111541965  2    V    1
## 3         our        1 PRP$         we      6 111541965  3    O    1
## 4  unfinished        1   JJ unfinished     10 111541965  4    A    1
## 5        task        1   NN       task     21 111541965  5    N    1
## 6          to        1   TO         to     26 111541965  6    ?    1
## 7     restore        1   VB    restore     29 111541965  7    V    1
## 8         the        1   DT        the     37 111541965  8    D    1
## 9       basic        1   JJ      basic     41 111541965  9    A    1
## 10    bargain        1   NN    bargain     47 111541965 10    N    1
## 11       that        1  WDT       that     55 111541965 11    D    1
## 12      built        1  VBD      build     60 111541965 12    V    1
## 13       this        1   DT       this     66 111541965 13    D    1
## 14    country        1   NN    country     71 111541965 14    N    1
## 15          :        1    :          :     78 111541965 15    .    1
## 16        the        1   DT        the     80 111541965 16    D    1
## 17       idea        1   NN       idea     84 111541965 17    N    1
## 18       that        1   IN       that     89 111541965 18    P    1
## 19         if        1   IN         if     94 111541965 19    P    1
## 20        you        1  PRP        you     97 111541965 20    O    1

As you can see, the result is similar to the ad-hoc lemmatized tokens, but now we have around 100 thousand tokens rather than 6. We can create a document-term matrix using the same commands as above, restricting ourselves to nouns, names, verbs, and adjectives:

t = sotu.tokens[sotu.tokens$pos1 %in% c("N", 'M', 'A'), ]
dtm = dtm.create(documents=t$aid, terms=t$lemma)
dtm
## <<DocumentTermMatrix (documents: 1090, terms: 1038)>>
## Non-/sparse entries: 20113/1111307
## Sparsity           : 98%
## Maximal term length: 14
## Weighting          : term frequency (tf)

So, we now have a "sparse" matrix of almost 7,000 documents by more than 70,000 terms. Sparse here means that only the non-zero entries are kept in memory, because otherwise it would have to keep all 70 million cells in memory (and this is a relatively small data set). Thus, it might not be a good idea to use functions like as.matrix or colSums on such a matrix, since these functions convert the sparse matrix into a regular matrix. The next section investigates a number of useful functions to deal with (sparse) document-term matrices.

Corpus analysis: word frequency

What are the most frequent words in the corpus? As shown above, we could use the built-in colSums function, but this requires first casting the sparse matrix to a regular matrix, which we want to avoid (even our relatively small dataset would have 400 million entries!). However, we can use the col_sums function from the slam package, which provides the same functionality for sparse matrices:

library(slam)
freq = col_sums(dtm)
# sort the list by reverse frequency using built-in order function:
freq = freq[order(-freq)]
head(freq, n=10)
##  America     year   people      new      job     more american  country 
##      409      385      327      259      256      255      239      228 
##    world      tax 
##      198      181

As can be seen, the most frequent terms are America and recurring issues like jobs and taxes. It can be useful to compute different metrics per term, such as term frequency, document frequency (how many documents does it occur), and td.idf (term frequency * inverse document frequency, which removes both rare and overly frequent terms). The function term.statistics from the corpus-tools package provides this functionality:

terms = term.statistics(dtm)
terms = terms[order(-terms$termfreq), ]
head(terms, 10)
##              term characters number nonalpha termfreq docfreq reldocfreq
## America   America          7  FALSE    FALSE      409     346    0.31743
## year         year          4  FALSE    FALSE      385     286    0.26239
## people     people          6  FALSE    FALSE      327     277    0.25413
## new           new          3  FALSE    FALSE      259     206    0.18899
## job           job          3  FALSE    FALSE      256     190    0.17431
## more         more          4  FALSE    FALSE      255     198    0.18165
## american american          8  FALSE    FALSE      239     210    0.19266
## country   country          7  FALSE    FALSE      228     202    0.18532
## world       world          5  FALSE    FALSE      198     156    0.14312
## tax           tax          3  FALSE    FALSE      181     102    0.09358
##           tfidf
## America  0.1042
## year     0.1181
## people   0.1233
## new      0.1314
## job      0.1692
## more     0.1331
## american 0.1444
## country  0.1329
## world    0.1712
## tax      0.2604

As you can see, for each word the total frequency and the relative document frequency is listed, as well as some basic information on the number of characters and the occurrence of numerals or non-alphanumeric characters. This allows us to create a 'common sense' filter to reduce the amount of terms, for example removing all words containing a letter or punctuation mark, and all short (characters<=2) infrequent (termfreq<25) and overly frequent (reldocfreq>.5) words:

subset = terms[!terms$number & !terms$nonalpha & terms$characters>2 & terms$termfreq>=25 & terms$reldocfreq<.25, ]
nrow(subset)
## [1] 239
head(subset, n=10)
##                term characters number nonalpha termfreq docfreq reldocfreq
## new             new          3  FALSE    FALSE      259     206    0.18899
## job             job          3  FALSE    FALSE      256     190    0.17431
## more           more          4  FALSE    FALSE      255     198    0.18165
## american   american          8  FALSE    FALSE      239     210    0.19266
## country     country          7  FALSE    FALSE      228     202    0.18532
## world         world          5  FALSE    FALSE      198     156    0.14312
## tax             tax          3  FALSE    FALSE      181     102    0.09358
## Americans Americans          9  FALSE    FALSE      179     158    0.14495
## nation       nation          6  FALSE    FALSE      171     150    0.13761
## Congress   Congress          8  FALSE    FALSE      168     149    0.13670
##            tfidf
## new       0.1314
## job       0.1692
## more      0.1331
## american  0.1444
## country   0.1329
## world     0.1712
## tax       0.2604
## Americans 0.1609
## nation    0.1578
## Congress  0.1523

This seems more to be a relatively useful set of words. We now have about 8 thousand terms left of the original 72 thousand. To create a new document-term matrix with only these terms, we can use normal matrix indexing on the columns (which contain the words):

dtm_filtered = dtm.filter(dtm, terms=subset$term)
dim(dtm_filtered)
## [1] 1086  239

Which yields a much more managable dtm. As a bonus, we can use the dtm.wordcloud function in corpustools (which is a thin wrapper around the wordcloud package) to visualize the top words as a word cloud:

dtm.wordcloud(dtm_filtered)

plot of chunk unnamed-chunk-18

Comparing corpora

Another useful thing we can do is comparing two corpora: Which words or names are mentioned more in e.g. Bush' speeches than Obama's.

To do this, we split the dtm in separate dtm's for Bush and Obama. For this, we select docment ids using the headline column in the metadata from sotu.meta, and then use the dtm.filter function:

head(sotu.meta)
##          id   medium     headline       date
## 1 111541965 Speeches Barack Obama 2013-02-12
## 2 111541995 Speeches Barack Obama 2013-02-12
## 3 111542001 Speeches Barack Obama 2013-02-12
## 4 111542006 Speeches Barack Obama 2013-02-12
## 5 111542013 Speeches Barack Obama 2013-02-12
## 6 111542018 Speeches Barack Obama 2013-02-12
obama.docs = sotu.meta$id[sotu.meta$headline == "Barack Obama"]
dtm.obama = dtm.filter(dtm, documents=obama.docs)
bush.docs = sotu.meta$id[sotu.meta$headline == "George W. Bush"]
dtm.bush = dtm.filter(dtm, documents=bush.docs)

So how can we check which words are more frequent in Bush' speeches than in Obama's speeches? The function corpora.compare provides this functionality, given two document-term matrices:

cmp = corpora.compare(dtm.obama, dtm.bush)
cmp = cmp[order(cmp$over), ]
head(cmp)
##          term termfreq.x termfreq.y relfreq.x relfreq.y   over   chi
## 939    terror          1         55 8.932e-05  0.004611 0.1942 48.87
## 941 terrorist         13        103 1.161e-03  0.008634 0.2243 64.63
## 389   freedom          8         79 7.145e-04  0.006623 0.2249 53.79
## 507     iraqi          3         49 2.680e-04  0.004108 0.2482 37.95
## 311     enemy          4         52 3.573e-04  0.004359 0.2533 38.29
## 506      Iraq         15         94 1.340e-03  0.007880 0.2635 52.66

For each term, this data frame contains the frequency in the 'x' and 'y' corpora (here, Obama and Bush). Also, it gives the relative frequency in these corpora (normalizing for total corpus size) and the overrepresentation in the 'x' corpus and the chi-squared value for that overrepresentation. So, Bush used the word terrorist 105 times, while Obama used it only 13 times, and in relative terms Bush used it about four times as often, which is highly significant.

Which words did Obama use most compared to Bush?

cmp = cmp[order(cmp$over, decreasing=T), ]
head(cmp)
##          term termfreq.x termfreq.y relfreq.x relfreq.y  over   chi
## 175   company         54          6  0.004823 5.030e-04 3.874 41.65
## 522       kid         31          0  0.002769 0.000e+00 3.769 33.07
## 72       bank         29          0  0.002590 0.000e+00 3.590 30.94
## 484  industry         32          1  0.002858 8.383e-05 3.560 31.20
## 368 financial         33          2  0.002947 1.677e-04 3.381 29.53
## 166   college         55          9  0.004912 7.545e-04 3.370 36.18

So, while Bush talks about freedom, war, and terror, Obama talks more about industry, banks and education.

Let's make a word cloud of Obama' words, with size indicating chi-square overrepresentation:

obama = cmp[cmp$over > 1,]
dtm.wordcloud(terms = obama$term, freqs = obama$chi)

plot of chunk unnamed-chunk-22

And Bush:

bush = cmp[cmp$over < 1,]
dtm.wordcloud(terms = bush$term, freqs = bush$chi)

plot of chunk unnamed-chunk-23

Note that the warnings given by these commands are relatively harmless: it means that some words are skipped because it couldn't find a good place for them in the word cloud.