-
Notifications
You must be signed in to change notification settings - Fork 0
/
word-counting.qmd
321 lines (222 loc) · 25.4 KB
/
word-counting.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
---
engine: knitr
---
# Dictionary-Based Word Counts {#sec-word-counting}
```{r setup}
#| echo: false
#| include: false
source("_common.R")
library(quanteda)
hippocorpus_corp <- read_csv("data/hippocorpus-u20220112/hcV3-stories.csv") |>
select(AssignmentId, story, memType, summary, WorkerId,
annotatorGender, openness, timeSinceEvent) |>
corpus(docid_field = "AssignmentId",
text_field = "story")
hippocorpus_tokens <- hippocorpus_corp |> tokens(remove_punct = TRUE)
hippocorpus_dfm <- hippocorpus_tokens |>
dfm() |>
dfm_remove("~")
```
In @sec-tokenization, we transformed our corpus into a DFM with counts of each word in each document. But not all words are created equal; some words are much more psychologically interesting than others. The simplest way to count relevant words while ignoring others is by using a **dictionary**.
This chapter introduces the basics of dictionary-based methodology. @sec-dla and @sec-word-counting-improvements will build on this chapter, exploring more advanced ways to use token counting for measurement.
## Dictionaries {#sec-quanteda-dictionaries}
A dictionary is a list of words (or other tokens) associated with a given psychological or other construct. For example, a dictionary for depression might include words like "sleepy" and "down." We can use the dictionary to count construct-related words in each text---texts that use more construct-related words are then assumed to be more construct-related overall.
Let's give a more concrete example: Recall that in the _Hippocorpus_ data, the `memType` variable indicates whether the participant was told to tell a story that happened to them recently ("recalled"), a story that they had already told a few months earlier ("retold"), or an entirely fictional story ("imagined").
@sap_etal_2022 hypothesized that true autobiographical stories would include more surprising events than imagined stories. To test this hypothesis, we could use a dictionary of surprise-related words. Where could we find such a dictionary? Perhaps we could try making one up?
```{r}
surprise_dict <- dictionary(
list(
surprise = c("surprise", "wow", "suddenly", "bang")
)
)
surprise_dict
```
Generating a sentiment dictionary is not easy. Luckily, other researchers have done the work for us: The NRC Word-Emotion Association Lexicon [@mohammad_turney_2010; @mohammad_turney_2013], included in the `quanteda.sentiment` package, has a list of 534 surprise words.
```{r}
surprise_dict <- quanteda.sentiment::data_dictionary_NRC["surprise"]
surprise_dict
```
The NRC Word-Emotion Association Lexicon is a **crowdsourced dictionary**; @mohammad_turney_2013 generated it by presenting individual words to thousands of online participants and asking them to rate how much each word is "associated with the emotion surprise." The final dictionary includes all the words that were consistently reported to be at least moderately associated with surprise.
## Understand Your Dictionary {#sec-understand-your-dictionary}
In @sec-look-at-your-data, we emphasized the importance of reading through your data before conducting any analyses. The same is true for dictionaries: Before using any dictionary-based methods, always look through your dictionary and ask yourself two questions:
- How was my dictionary constructed?
- How context-dependent are the words in my dictionary?
Let's expand on each of these questions.
### How Was Your Dictionary Constructed?
The surprise dictionary we are using was generated by asking participants how much each word was "associated with the emotion surprise" [@mohammad_turney_2013]. A word can be "associated with" surprise because it reflects surprise (e.g. "suddenly"), but it can also be "associated with" surprise because it reflects the exact opposite of surprise. Indeed, if we **look through the dictionary**, we find words like "leisure" and "lovely".
```{r}
set.seed(8)
sample(surprise_dict$surprise, 20)
```
This means that we are not, in fact, measuring how surprising each story is. At best, we are measuring how much each story deals with surprise (or lack thereof) one way or another.
As you look through your dictionary, make sure you are aware of the process used to construct the dictionary. If it was generated by asking participants about individual words, how was the question formulated? How might that question have been interpreted by the participants?
### How Context-Dependent are the Words in Your Dictionary?
The participants generating our dictionary were asked about one word at a time. People presented words out of context often fail to consider how words are actually used in natural discourse. For example, imagine that you are an online participant, and you are asked about your associations with the word “guess”. Seeing "guess" by itself might sound like an imperative, calling to mind a situation in which someone is asking you to guess something about which you are unsure---perhaps a game show. Since this sort of situation generally results in a surprise when the truth is revealed, you report that "guess" is associated with surprise. In fact, though, "guess" is _much_ more frequently used in the phrase "I guess", which signifies reluctance and has very little to do with surprise. We can check how "guess" is used our corpus by using Quanteda's `kwic()` function, which gives a dataframe of Key Words In Context (KIWC).
```{r}
hippocorpus_tokens |>
kwic("guess") |>
mutate(text = paste(pre, keyword, post)) |>
pull(text)
```
With the possible exception of #6, none of these examples give the impression of an impending surprise. Nevertheless, "guess" does appear in the NRC surprise dictionary.
As you look through your dictionary, think about how each word might really be used in context. Are there ways to use the word that do not have to do with your construct?
## Raw Word Counts
At this point, you might be pretty skeptical about using the NRC surprise dictionary to measure surprise. Even so, let's try it out. To count how many times surprise words appear in each of our texts, we use the `dfm_lookup()` function.
```{r}
hippocorpus_surprise <- hippocorpus_dfm |>
dfm_lookup(surprise_dict)
hippocorpus_surprise
```
### Modeling Raw Word Counts {#sec-modeling-word-counts}
Recall that we wanted to test whether true autobiographical stories include more surprise than imagined stories. Now that we have counted the number of surprise words in each document, how do we test our hypothesis?
A good first step is to reattach the word counts to our original corpus. As we do this, we convert both to dataframes.
```{r}
#| output: false
hippocorpus_surprise_df <- hippocorpus_surprise |>
convert("data.frame") |> # convert to dataframe
right_join(
hippocorpus_corp |>
convert("data.frame") # convert to dataframe
)
```
It makes sense to control for the total number of words in each text, since longer texts have more opportunities to use surprise words^[We use total word count here for the sake of the example, but total word count may not always be the appropriate measure of text length. For example, you may want to measure the amount of surprise _relative to other emotional content_. In this case, it would be more appropriate to control for the total number of emotion-related words, as opposed to the total word count. Similarly, if you were measuring the number of first person singular pronouns, you may want to control for the total number of pronouns rather than the total word count.]. To count the total number of tokens in each text, we can use the `ntoken()` function on our DFM and add the result directly to the new dataframe.
```{r}
hippocorpus_surprise_df <- hippocorpus_surprise_df |>
mutate(wc = ntoken(hippocorpus_dfm))
```
We are now ready for modeling! When your dependent variable is a count of words, we recommend using negative binomial regression, available in R with the `MASS` package^[We use a simple count of words as the dependent variable here, but keep in mind that it may be more appropriate to apply a transformation such as Simple Good-Turing frequency estimation (@sec-smoothing).]. For extra sensitivity to the variable rates at which word frequencies grow with text length [see @baayen_2001], we include `wc` as a both a predictor and an offset `offset(log(wc))` in the regression (an offset is just a predictor with its parameter at 1). We use `log()` to account for the fact that negative binomial regression links the predictors with the outcome variable through a log link. This means that including `offset(log(wc))` is equivalent to modeling the ratio of surprise words to total words (for a more detailed explanation of this dynamic, see the discussion [here](https://stats.stackexchange.com/questions/307369/how-to-interpret-glm-and-ols-with-offset)).
```{r}
surprise_mod <- MASS::glm.nb(surprise ~ memType + wc + offset(log(wc)),
data = hippocorpus_surprise_df)
summary(surprise_mod)
```
Looking at the p-values for the coefficients, we see that there was no significant difference between recalled and imagined stories (p = `r round(summary(surprise_mod)$coefficients["memTyperecalled","Pr(>|z|)"],3)`). There was, however, a significant difference between _retold_ and imagined stories, such that retold stories used fewer surprise words (p = `r round(summary(surprise_mod)$coefficients["memTyperetold","Pr(>|z|)"],3)`).
**An example of using raw word counts in research:** @simchon_etal_2023 collected Twitter activity over a three month period from over 2.7 million users. Using a dictionary, they then counted the number of passive auxiliary verbs (e.g. "they **were** analyzed"; "my homework **will be** completed") in each user's activity. They found that users with more followers (indicating higher social status) used much fewer passive auxiliary verbs, controlling for total word count.
## Polarity {#sec-polarity}
How can we improve our measurement of surprise? As we saw above, one problem with the dictionary approach is that a word might be associated with a construct because it reflects the opposite of that construct. One solution to this problem is to measure the ratio between the target dictionary and its opposite. In sentiment analysis, this approach is called _polarity_. Polarity is most commonly used to analyze the overall valence of a text by comparing positive words (e.g. "happy", "great") with negative words (e.g. "disappointed", "terrible"). In principle though, we can use it to compare any sort of opposites.
What is the opposite of surprise? @plutchik_1962 argues that the opposite of surprise is _anticipation_. Luckily, the NRC Word-Emotion Association Lexicon also includes a dictionary of anticipation-associated words. Using this dictionary, we can measure how much a text is associated with surprise _as opposed to_ anticipation.
Quanteda's built-in function for polarity is `textstat_polarity()`. To use this function, we first have to set the "positive" and "negative" polarities of the dictionary, and then call `textstat_polarity()` on our DFM. By default, this outputs the log ratio of positive to negative counts for each document:
```{r}
library(quanteda.sentiment)
# subset dictionary
surprise_anticipation_dict <- data_dictionary_NRC[c("surprise", "anticipation")]
# set surprise and anticipation as polarity
polarity(surprise_anticipation_dict) <- list(pos = "surprise", neg = "anticipation")
# get polarity
hippocorpus_surprise_polarity <-
textstat_polarity(hippocorpus_dfm, surprise_anticipation_dict) |>
rename(surprise_vs_anticipation = sentiment)
```
While `textstat_polarity()` can sometimes be useful for visualizations or downstream analyses, it is not helpful for modeling polarity as an outcome variable.
### Modeling Polarity
To test whether true autobiographical stories include more surprise _relative_ to anticipation than imagined stories, we first count the surprise and anticipation words in each document, and rejoin the results to the full dataset.
```{r}
# count surprise/anticipation words
hippocorpus_surprise_anticipation <- hippocorpus_dfm |>
dfm_lookup(surprise_anticipation_dict)
# convert to dataframe and join to full data
hippocorpus_surprise_anticipation_df <-
hippocorpus_surprise_anticipation |>
convert("data.frame") |>
right_join(
hippocorpus_corp |>
convert("data.frame"), # convert to dataframe
by = "doc_id"
) |>
mutate(wc = ntoken(hippocorpus_dfm))
```
Since we are still modelling word counts as an output, we again use negative binomial regression. Rather than controlling for the total word count, however, we can control for the total number of surprise words plus the number of anticipation words. Because of the log link function (along with the endlessly useful properties of logarithms) entering this sum as a log offset (`offset(log(surprise + anticipation))`) is equivalent to modeling the ratio of surprise-related to anticipation-related words.
```{r}
#| warning: false
# remove zeros to prevent divide by zero errors
hippocorpus_surprise_anticipation_df <-
hippocorpus_surprise_anticipation_df |>
filter(surprise + anticipation > 0)
set.seed(2024)
surprise_anticipation_mod <- MASS::glm.nb(
surprise ~ memType + wc + offset(log(surprise + anticipation)),
data = hippocorpus_surprise_anticipation_df,
# increase iterations to ensure model converges
control = glm.control(maxit = 10000)
)
summary(surprise_anticipation_mod)
```
There is no significant difference between true and imagined stories in the ratio of surprise to anticipation words.
## Lexical Norms {#sec-word-scoring}
So far we have covered raw word counts, which use one list of words to represent a construct, and we have covered polarities, which use two lists of words to represent a construct and its opposite. The third and final dictionary-based method takes a more nuanced approach than either of these: In lexical norms, words are allowed to represent the construct or its opposite to continuously varying degrees, represented by numbers on a scale. In `quanteda.sentiment`, this scale is called "valence", though elsewhere it can be called "lexical affinity" or "lexical association".
The same group that created the NRC Word-Emotion Association Lexicon also created a parallel dictionary with continuous scores: the [NRC Hashtag Emotion Lexicon](https://saifmohammad.com/WebPages/AccessResource.htm) [@mohammad_kiritchenko_2015]. Whereas the NRC Word-Emotion Association Lexicon was crowdsourced, the NRC Hashtag Emotion Lexicon was generated algorithmically from a corpus of Twitter posts which contained hashtags like "#anger" and "#surprise". The dictionary includes the words that were most predictive of each hashtag, with scores indicating the strength of their statistical connection with the category (higher score indicates more representative). We can access the NRC Hashtag surprise dictionary from Github:
```{r}
path <- "https://raw.githubusercontent.com/bwang482/emotionannotate/master/lexicons/NRC-Hashtag-Emotion-Lexicon-v0.2.txt"
hashtag <- read_tsv(path, col_names = c("emotion", "token", "score"))
hashtag |>
filter(emotion == "surprise") |>
head()
# Create dictionary
surprise_dict_hashtag <- dictionary(
list(surprise = hashtag$token[hashtag$emotion == "surprise"])
)
# Set dictionary valence
valence(surprise_dict_hashtag) <- list(
surprise = hashtag$score[hashtag$emotion == "surprise"]
)
```
To measure suprise in the Hippocorpus data, we find the suprise score of each token and compute the average score for the tokens of each document. With `quanteda.sentiment`, we can do this by calling the `textstat_valence()` function on our DFM. Since a score of zero in the NRC Hashtag Emotion Lexicon represents zero surprise, we will add `normalization = "all"` to code non-dictionary words as zero by default.
```{r}
#| output: false
# compute valence
hippocorpus_valence <- textstat_valence(
hippocorpus_dfm, # data
surprise_dict_hashtag, # dictionary
normalization = "all"
)
# rejoin to original data
hippocorpus_valence <- hippocorpus_valence |>
rename(surprise = sentiment) |>
right_join(
hippocorpus_corp |>
convert("data.frame") # convert to dataframe
)
```
### Modeling Norms
Norm scores, unlike raw word counts and polarities, can be reasonably modeled using standard linear regression. Furthermore, because the score is an average rather than a sum or count, there is no need to control for total word count. Let's test one more time whether true autobiographical stories include more surprise-related language than imagined stories:
```{r}
surprise_score_mod <- lm(surprise ~ memType, hippocorpus_valence)
summary(surprise_score_mod)
```
We found a significant difference between recalled and imagined stories (p < .001), such that recalled stories have more surprise-related language! This supports Sap et al.'s hypothesis that true autobiographical stories would include more surprising events than imagined stories. The new model also indicated a significant difference between retold and imagined stories, such that retold stories used _more_ surprise-related language---the opposite direction relative to our original finding with the crowdsourced dictionary (p = `r round(summary(surprise_score_mod)$coefficients["memTyperetold","Pr(>|t|)"],3)`).
## Sources of Dictionaries {#sec-dictionary-sources}
So far we have seen the NRC Word-Emotion Association Lexicon, which used a crowdsourcing approach to generate the dictionary, and the NRC Hashtag Emotion Lexicon, which used a corpus-based approach, relying on hashtags for labeling. Crowdsourcing and algorithmic corpus-based generation are far from the only ways to generate a dictionary. Here we review various types of dictionaries and where to find them.
### Crowdsourced Dictionaries
Besides the surprise dictionary, the NRC Word-Emotion Association Lexicon includes dictionaries for anger, fear, anticipation, trust, sadness, joy, and disgust. The same group has also produced other crowdsourced emotion dictionaries:
- [NRC VAD](https://saifmohammad.com/WebPages/nrc-vad.html) [@mohammad_2018] contains 20,007 words with ratings between 0 and 1 for valence, arousal and dominance.
- [NRC Affect Intensity](http://saifmohammad.com/WebPages/AffectIntensity.htm) [@mohammad_2018b] contains 4192 words with ratings between 0 and 1 for anger, fear, sadness and joy.
Psychologists have used crowdsourcing questionnaires to create dictionaries (especially norms) for decades. As such, crowdsourced dictionaries exist for many psychologically interesting constructs:
- @brysbaert_etal_2014 used an internet questionnaire to obtain norms for concreteness (i.e. the extent to which a word refers to a perceptible entity). The result, including nearly 40,000 words and 2-grams, is available as an Excel file [here](https://static-content.springer.com/esm/art%3A10.3758%2Fs13428-013-0403-5/MediaObjects/13428_2013_403_MOESM1_ESM.xlsx).
- @kuperman_etal_2012 asked participants at what age they learned each word, resulting in age-of-acquisition norms for 30,000 English words.
- @warriner_etal_2013 crowdsourced norms for valence, arousal, and dominance, expanding on the ANEW dictionary included in `quanteda.sentiment`. The expanded norms are available as a zip file [here](https://static-content.springer.com/esm/art%3A10.3758%2Fs13428-012-0314-x/MediaObjects/13428_2012_314_MOESM1_ESM.zip).
- @stadthagen_davis_2006 collected norms for age-of-acquisition, familiarity, and imageability (the ease with which a word evokes mental images) by surveying undergraduates.
- @diveica_etal_2023 asked online participants to rate the social relevance of words. The resulting "socialness" norms are available [here](https://osf.io/yjz85/).
### Expert-Generated Dictionaries
Words are used in many contexts, sometimes with many possible meanings. To take these into account, some groups rely on experts to generate their dictionaries. By far the most prominent collection of expert-generated dictionaries is [LIWC](https://www.liwc.app) (pronounced "Luke"), which includes word lists for grammatical patterns, emotional content, cognitive processes, and more. With its rigorous approach, LIWC has dominated the field of dictionary-based analysis in psychology for decades. The most recent version of LIWC [@boyd_etal_2022] was generated by a team of experts who went through numerous stages of brainstorming, voting, and reliability analysis before arriving at the final word lists.
### Corpus-Based Dictionaries
Human raters are much better at judging full texts than individual words. Corpus-based dictionaries take advantage of this by extracting their word lists from corpora of full texts that have been rated by humans. We have already seen the [NRC Hashtag Emotion Lexicon](https://saifmohammad.com/WebPages/AccessResource.htm) [@mohammad_kiritchenko_2015], which used Twitter hashtags to gather a corpus of Tweets labeled with emotions by their original authors. A more classic example of corpus-based dictionary generation is @rao_etal_2014, who used a corpus of 1,246 news headlines, each rated manually for anger, disgust, fear, joy, sad and surprise on a scale from 0 to 100 [@strapparava_mihalcea_2007]. By correlating these ratings with frequencies of words (see @sec-dla), they extracted the words that were most representative of high ratings in each category. @araque_etal_2018 used a similar technique to create [DepecheMood](https://github.com/marcoguerini/DepecheMood), which includes ratings for each word on eight emotional dimensions: afraid, amused, angry, annoyed, don't care, happy, inspired, and sad. This base dictionary was updated with additional resources by @badaro_etal_2018 to create EmoWordNet, which can be accessed [through the Internet Archive](https://web.archive.org/web/20210906101337/http://oma-project.com/ArSenL/EmoWordNet1.0.txt).
Many statistical techniques have been used to extract dictionaries from labeled corpora, some of which will be covered briefly in @sec-dla and @sec-decontextualized-embeddings of this book. For a recent review of methods, see @bandhakavi_etal_2021.
### Other Approaches to Dictionary Generation
- **Thesaurus Mining:** @strapparava_valitutti_2004 started with a short list of strongly affect-related words (e.g. "anger", "doubt", "cry"), and used [WordNet](https://wordnet.princeton.edu), a database of conceptual relations between words, to find close synonyms of the original words on the list. The result was [WordNet Affect](https://wndomains.fbk.eu/wnaffect.html). @strapparava_mihalcea_2007 used WordNet Affect to generate short lists of words associated with anger, disgust, fear, joy, sadness, and surprise, downloadable from [here](https://web.eecs.umich.edu/~mihalcea/affectivetext/).
- **Decontextualized Embeddings:** In @sec-decontextualized-embeddings, we will cover a family of methods for measuring the similarities between words based on how frequently they appear together in text: decontextualized embeddings. These methods can be used on their own for measuring psychological constructs, but they can also be used as a tool for building dictionaries. For example, @buechel_etal_2020 started with a small seed lexicon and used word embeddings (@sec-word-embeddings) to find other words that are likely to appear in texts of the same topic. The result---including dictionaries for valence, arousal, dominance, joy anger, sadness, fear, and disgust---is [available for download online](https://zenodo.org/records/3756607).
- **Combined Methods:** @vandervegt_etal_2021 used a combination of expert input, thesaurus data from [WordNet](https://wordnet.princeton.edu), word embeddings (@sec-word-embeddings), and crowdsourcing from online participants to generate norms for numerous constructs associated with grievance-fueled violence (e.g. desperation, fixation, frustration, hate, weapons). The final product is available [here](https://github.com/Isabellevdv/grievancedictionary/tree/main).
---
::: {.callout-tip icon="false"}
## Advantages of Dictionary-Based Word Counts
- **Efficient Processing:** Counting is a simple operation for computers. For very large datasets, this can make a big difference.
- **Easy to Interpret:** Dictionaries for sentiment analysis are usually not more than a few hundred words long. This means that they are easy to read through and understand intuitively. The intuitive appeal is also good for explaining your research to others---"we counted the number of anger-related words" is a method that any non-expert can understand.
:::
::: {.callout-important icon="false"}
## Disadvantages of Dictionary-Based Word Counts
- **No Context:** Dictionary-based word counts treat texts as bags of words. This means they entirely ignore word order (aside from the order of any n-grams that might be included in the dictionary).
- **May Reflect Various Constructs:** Dictionaries are often generated by asking participants to identify associations with words. These associations do not necessarily reflect the construct in which the researcher is interested.
- **Unnuanced:** Words are either in a dictionary or they are not. Raw counts carry no nuance about the varying degrees to which different words may reflect the construct of interest. Norms can fix this problem, but are not available for many psychological dimensions.
- **Unnaturalistic Generation Process:** Dictionaries are generally crowdsourced by asking participants to report their associations with individual words. People presented words out of context often fail to consider how words are actually used in natural discourse.
- **Limited Dictionaries Available:** Dictionaries are expensive and labor intensive to produce. Researchers are generally reliant on dictionaries already produced by other teams, which may not reflect the construct of interest precisely.
:::
---