-
Notifications
You must be signed in to change notification settings - Fork 17
/
sentiment.Rmd
123 lines (87 loc) · 3.47 KB
/
sentiment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
title: "Lexical Sentiment Analysis"
author: "Wouter van Atteveldt"
date: "June 3, 2016"
output: pdf_document
---
```{r, echo=F}
head = function(...) knitr::kable(utils::head(...))
```
Dictionaries of positive and negative terms can be used to do sentiment analysis,
assuming that a document with many positive terms will have a more positive sentiment.
Lexicon and data
---
For this exercise, we will use the Pittsburgh sentiment dictionary and the Amazon automotive reviews
as described in the 'Getting Sentiment Resources' hand-out.
These files can be directly downloaded from [http://rawgit.com/vanatteveldt/learningr/master/data/reviews.rds](http://rawgit.com/vanatteveldt/learningr/master/data/reviews.rds) (reviews) and [http://rawgit.com/vanatteveldt/learningr/master/data/lexicon.rds](http://rawgit.com/vanatteveldt/learningr/master/data/lexicon.rds) (lexicon).
```{r}
reviews = readRDS("data/reviews.rds")
lexicon = readRDS("data/lexicon.rds")
```
Applying Sentiment Dictionary to DTM
===
You can directly apply a dictionary to the document term matrix
by summing the columns that match each category.
First, we create the document term matrix consisting of all reviews:
```{r, message=F}
library(RTextTools)
dtm = create_matrix(reviews[c("summary", "reviewText")], language="english", stemWords=T)
```
And we select the words that are in the negative or positive category:
```{r}
pos_words = lexicon$word1[lexicon$priorpolarity == "positive"]
neg_words = lexicon$word1[lexicon$priorpolarity == "negative"]
```
Now, we use these words to subset the dtm, and use `row_sums` to sum all words in the same category for each document:
```{r}
library(slam)
reviews$npos = row_sums(dtm[, colnames(dtm) %in% pos_words])
reviews$nneg = row_sums(dtm[, colnames(dtm) %in% neg_words])
```
Finally, we can calculate a sentiment score, for example the number of positive minus negative words normalized by the total number of sentiment words:
```{r}
reviews$sent = (reviews$npos - reviews$nneg) / (reviews$npos + reviews$nneg)
reviews$sent[is.na(reviews$sent)] = 0
```
Validating sentiment
===
The best way to validate dictionary results is to compare them with manual coding.
In this case, we can compute the average calculated sentiment per coded sentiment rating:
```{r}
cat(length(reviews$sent))
cat(length(reviews$overall))
tapply(reviews$sent, reviews$overall, mean, na.rm=T)
```
So, the higher the sentiment score, the higher the manaully coded sentiment. The correlation between the two is low, though:
```{r}
cor.test(reviews$sent, reviews$overall)
```
An alternative is to do linear discriminant analysis with a dichotomous dependent variable,
taking only the 5 star ratings:
```{r}
reviews$positive = as.factor(ifelse(reviews$overall == 5, "pos", "neg"))
m = MASS::lda(positive ~ sent, data=reviews, CV=T)
```
And compute the classification accuracy:
```{r}
t = table(reviews$positive, m$class)
sum(diag(t)) / sum(t)
```
Which is not great considering there are only two answer categories.
Applying sentiment to token lists
===
We can also apply sentiment to a token list, for example the state of the union speeches.
```{r, message=F}
library(corpustools)
data(sotu)
sotu.tokens$sent = 0
sotu.tokens$sent[sotu.tokens$word %in% pos_words] = 1
sotu.tokens$sent[sotu.tokens$word %in% neg_words] = -1
head(sotu.tokens)
```
And compute the mean sentiment per article:
```{r}
sent = aggregate(sotu.tokens["sent"], sotu.tokens["aid"], mean)
sent = merge(sent, sotu.meta, by.x="aid", by.y="id")
head(sent)
```