-
Notifications
You must be signed in to change notification settings - Fork 0
/
word-counting-improvements.qmd
369 lines (259 loc) · 30.2 KB
/
word-counting-improvements.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
# Transforming Word Counts {#sec-word-counting-improvements}
```{r setup}
#| echo: false
#| include: false
source("_common.R")
library(quanteda)
library(quanteda.textstats)
hippocorpus_corp <- read_csv("data/hippocorpus-u20220112/hcV3-stories.csv") |>
select(AssignmentId, story, memType, summary, WorkerId,
annotatorGender, openness, timeSinceEvent) |>
corpus(docid_field = "AssignmentId",
text_field = "story")
hippocorpus_dfm <- hippocorpus_corp |>
tokens(remove_punct = TRUE) |>
dfm() |>
dfm_remove("~")
crowdflower <- read_csv("data/text_emotion.csv") |>
rename(text = content)
crowdflower_corp <- crowdflower |>
corpus(docid_field = "tweet_id")
crowdflower_dfm <- crowdflower_corp |>
tokens(remove_punct = TRUE) |>
tokens_ngrams(n = c(1L, 2L)) |> # 1-grams and 2-grams
dfm()
surprise_pmi <- crowdflower_dfm |>
textstat_keyness(
docvars(crowdflower_dfm, "sentiment") == "surprise",
measure = "pmi"
)
surprise_pmi <- surprise_pmi |>
filter(n_target + n_reference > 10,
pmi > 1.5)
tweet_surprise_dict <- dictionary(
list(surprise = surprise_pmi$feature),
separator = "_" # in n-grams, tokens are separated by "_"
)
```
So far, we have used statistical methods that work with raw counts, like negative binomial regression in @sec-modeling-word-counts, and likelihood ratio testing in @sec-keyness (with the minor exception of @sec-word-scoring, which used averaged scores). Raw token counts are difficult to work with: They are not normally distributed, they do not usually change linearly with predictors, and they are overly sensitive to quirks of linguistic style and text length.
Researchers and engineers have proposed various ways to fix these problems by transforming the raw counts or by transforming the text before performing the counting. In this chapter, we introduce a few of the most common transformations. These transformations can be applied to any analysis of word counts, including dictionary-based analysis (@sec-word-counting) and open vocabulary methods (@sec-dla), to bypass certain problems or bring out certain features of the data. As always, each technique has its own advantages and disadvantages.
## Text Normalization
The simplest way to transform word counts is by transforming the text itself. We have already seen some simple examples of this in @sec-custom-preprocessing: removing punctuation, symbols, or URLs before tokenization. Such transformations are often called **text normalization**, since they get rid of the quirks in each text and ensure that everything follows a standard format.
### Occurrence Thresholds
Besides removing or standardizing certain types of tokens, like URLs, researchers commonly enforce an **occurrence threshold**, removing any token that occurs less than a certain number of times in the data. Occurrence thresholds can be calculated on the full dataset (term frequency) or between documents (document frequency; e.g. removing tokens used in fewer than 1% of documents). Using a document frequency threshold is often beneficial, since sometimes a single document will use a token a lot---either because it happens to be discussing a specific topic or because of a quirk of the author's language style---and drive up the overall frequency in a misleading way.
Occurrence thresholds can be performed on DFMs in Quanteda using the `dfm_trim()` function, with either a `min_termfreq`, a `min_docfreq`, or both. Maximum frequency thresholds can also be imposed.
```{r}
# remove tokens used by fewer than 1% of documents
hippocorpus_dfm_threshold <- hippocorpus_dfm |>
dfm_trim(min_docfreq = 0.01, docfreq_type = "prop")
```
::: {.callout-tip icon="false"}
## Advantages of Occurrence Thresholds
- **Cleaner Results:** Occurence thresholds are an easy way to remove quirks of individuals' writing styles, or very rare terms that complicate analysis without adding much information.
:::
::: {.callout-important icon="false"}
## Disadvantages of Occurrence Thresholds
- **Arbitrary:** Determining what threshold to use can be difficult, and runs the risk of excluding important information from the analysis.
:::
### Removing Stop Words {#sec-stopwords}
In natural language processing, a common step in text normalization is to remove "stop words," everyday words like "the" and "of" that do not contribute much to the meaning of the text. Indeed, Quanteda offers a built-in list of stop words:
```{r}
stopwords() |> head()
```
Although removing stop words can be useful for analyzing the *topics* of texts, it is generally a bad idea when you are interested in the *psychology* of texts. This is because **the forms in which people choose to write a word---including stop words---are often predictive of their personality**. For example, neurotic people tend to use more first-person singulars [@mehl_etal_2006], and articles like "the" and "a" are highly predictive of males, being older, and openness [@schwartz_etal_2013].
The relationships between language and personality also extend to more subtle patterns. For example, extraverts tend to use longer words [@mehl_etal_2006], those high in openness tend to use more quotations [@sumner_etal_2011], and those high in neuroticism tend to use more acronyms [@holtgraves_2011]. So if you are looking for psychological differences, be gentle with the text normalization---you never know what strange predictors you might find.
::: {.callout-tip icon="false"}
## Advantages of Removing Stop Words
- **Intuitive Appeal:** Removing stop words focuses an analysis on content, rather than form. When people think of differences between texts, they generally think of differences in content.
:::
::: {.callout-important icon="false"}
## Disadvantages of Removing Stop Words
- **Removes Important Information:** While words like "the" and "a" may seem insignificant, they often carry important psychological information.
:::
## Binary (Boolean) Tokenization
In some cases, it makes sense to stop counting at one---each text either uses a given token or it does not. While this might seem like needlessly throwing away information, binary tokenization fixes a core problem with the bag of words assumption (BOW). Recall from @sec-quanteda-dfms that BOW imagines that each author or topic has its characteristic bag of words, and speaking or writing is just a matter of pulling those words out of the bag one at a time at random. A central problem with this picture is that words are not pulled out one at a time at random---the word I am writing now is intimately tied to the words immediately before it. It may be very unlikely overall that I will write "parthenon," but if I write it once, it is very likely that I will write it again in the same paragraph. This is because I am probably writing about the Parthenon.
The non-independence of words in text means that the difference between zero occurrences of "parthenon" and one occurrence is much more meaningful than the difference between one and two. If a particular token sometimes occurs lots of times in text, statistical procedures like regression may be led to focus on that variance rather than on the more interesting first occurrence. Binary tokenization is the simplest way to avoid this problem.
In Quanteda, a DFM can be converted to binary tokenization with `dfm_weight(scheme = "boolean")`.
```{r}
hippocorpus_dfm_binary <- hippocorpus_dfm |>
dfm_weight(scheme = "boolean")
print(hippocorpus_dfm_binary, max_ndoc = 6, max_nfeat = 6)
```
To use binary tokenization in dictionary-based analysis, you can also perform the dictionary look-up first (@sec-word-counting) and then convert it to binary with `mutate(surprise_binary = as.integer(surprise > 0))`.
Keep in mind: Once you convert your DFM to binary tokenization, you are no longer working with a count variable. This means that the negative binomial regression models we covered in @sec-word-counting are no longer appropriate. Instead, if you want to model the binary tokenization as a dependent variable, you will have to use a binary model like logistic regression.
::: {.callout-tip icon="false"}
## Advantages of Binary Tokenization
- **Removes Non-Independence of Observations:** Raw word counts can be misleading (to both humans and statistical models) because the observations are not independent; the more a text uses a word, the more likely it is to use that word again. Binary tokenization avoids this problem by only counting one event per text.
:::
::: {.callout-important icon="false"}
## Disadvantages of Binary Tokenization
- **Devalues Common Tokens:** With binary tokenization, you might miss differences in common tokens like "the." Since almost every text uses "the" at least once, you won't be able to detect that one group uses "the" more often than another.
- **Difficult to Control for Text Length:** The longer a text is, the more likely any given word is to appear in it. But when we stop counting after the first word, this relationship becomes especially difficult to characterize. When working with shorter texts, beware of mistaking differences in text length for differences in the probability of a word appearing!
:::
## Relative Tokenization {#sec-relative-tokenization}
The most common transformation that researchers apply to word counts is dividing them by the total number of words (or other tokens) in the text. The resulting ratio is referred to as a **relative frequency**. This strategy has intuitive appeal (e.g. what percentage of the words are surprise-related), and the value of intuitive appeal should not be discounted [see @sec-simplify-the-story]. But working with ratios or percentages can cause problems with statistical analysis. This is because dividing by the total number of words is no guarantee that the total number of words will be properly controlled for---these are two separate variables. Regression analyses on relative frequencies are likely to give false positives when predictors are correlated with text length [@kronmal_1993]. This is the same reason why we added total word count as both a log offset and a regular predictor in @sec-modeling-word-counts.
Before you start worrying about how to control for text length, make sure to stop and ponder: Do you want to control for text length? Usually we assume that longer documents are longer because the author is more verbose. But what if longer texts are longer because they cover multiple topics, and one of those topics is what we are interested in? In this case, controlling for text length---especially using relative tokenization---will make it look like the longer texts have less of that topic, when in fact they just have more of other topics.
If you decide to use relative tokenization, the process is simple. For dictionary-based word counts, divide your count by the total word count to get a relative frequency. If you want to convert a full DFM to relative tokenization, you can use `dfm_weight(scheme = "prop")`.
```{r}
hippocorpus_dfm_relative <- hippocorpus_dfm |>
dfm_weight(scheme = "prop")
print(hippocorpus_dfm_relative, max_ndoc = 6, max_nfeat = 4)
```
**An example of relative tokenization in research:** Golder & Macy (2011) collected messages from all Twitter user accounts (\~2.4 million) created between February 2008 and April 2009, and measured positive and negative affect as proportion of in-dictionary (from LIWC) words to total word count. This calculation was done by hour of the day, day of the week, and month of the year, revealing fluctuations in mood in line with circadian rhythms and seasonal changes.
::: {.callout-tip icon="false"}
## Advantages of Relative Tokenization
- **Intuitive Appeal:** Relative tokenization makes sense if longer documents are longer because of verbosity, or if the construct of interest does not fluctuate over the course of a longer text (e.g. personality).
:::
::: {.callout-important icon="false"}
## Disadvantages of Relative Tokenization
- **Discounts Longer Texts:** If texts are long because they cover multiple topics (or multiple emotions), relative tokenization will dilute true occurrences of the construct of interest.
- **Does Not Control for Text Length:** People often assume that using a percentage will control for the denominator in statistical analyses. This is wrong, which might make ratios like relative tokenization more trouble than they're worth.
- **Not Normally Distributed:** Dividing count variables by text length does not make them normally distributed. This can cause problems for certain statistical methods. Using the Anscombe transform can partially remedy these problems.
:::
## The Anscombe Transform {#sec-anscombe}
Word counts (whether divided by text length or not) are not normally distributed. In @sec-word-counting, we avoided this problem by using negative binomial regression. An alternative way to deal with this problem is to try to transform the counts to a normal distribution. Remember that complicated-looking formula from @schwartz_etal_2013 in @sec-simplify-the-story? The upper part of that formula is relative tokenization. The lower part is called the Anscombe transform, and it transforms a Poisson distribution (i.e. a well-behaved count variable) into an approximately normal distribution. The transformed variable can then be used in linear regression. In R, the Anscombe transform can be written as `2*sqrt(count + 3/8)`.
The Anscombe transform can be useful for analyzing very large numbers of word count variables (as @schwartz_etal_2013 did) without having to run negative binomial regression each time. But if you are only analyzing a few variables, we recommend against it. This is because word counts do not follow a Poisson distribution and therefore will not be properly normally distributed even after the transform. The Poisson process assumes that words occur independently from each other, and with equal probability throughout a text. In other words, it makes the BOW assumption. As we've covered already, this assumption is problematic. In this case, the fact that words are not pulled randomly out of a bag makes word count distributions *overdispersed*, meaning that the variance of the distribution is higher than expected in a Poisson process. Negative binomial regression may be complicated and take longer to compute, but it tends to be robust to overdispersion.
::: {.callout-tip icon="false"}
## Advantages of the Anscombe Transform
- **Computational Efficiency**
- **Addresses Non-Normality of Word Frequencies**
:::
::: {.callout-important icon="false"}
## Disadvantages of the Anscombe Transform
- **Wrongly Assumes That Word Counts Follow a Poisson Distribution**
:::
## TF-IDF {#sec-tfidf}
Term frequency-inverse document frequency (TF-IDF) is one of the great triumphs of NLP in the last century. It was first introduced by @sparckjones_1972 and is still widely used half a century later, especially in search engines.
The idea of TF-IDF is to measure how *topical* a token is in a document In other words, how representative is that token of the particular features of that document as opposed to other documents? Let's start with term frequency (TF). TF is just relative frequency - a word count divided by the total number of words in the document The problem with relative frequency is that it emphasizes common words that don't tell us much about the meaning of the document---if a document has a high frequency of "the," should we conclude that "the" is very important to the meaning of the document? Of course not. To fix this, we multiply TF by inverse document frequency (IDF). Document frequency is the proportion of documents that have at least one instance of our token (i.e. the average binary tokenization across documents). IDF is the log of the inverse of the document frequency. **IDF answers the question: How unusual is it for this token to appear at all in a document?** So "the," which appears in almost every document, will have a very low IDF---it is not unusual at all. Even though "the" appears a lot in our document (TF is high), its TF-IDF score will be low, reflecting the fact that "the" doesn't tell us much about the content of this particular document.
$$
TF \cdot IDF = relative\:frequency × \log{\frac{total\:documents}{documents\:with\:token}}
$$
You might wonder: If we want to compare how common the token is in this document to how common it is in other documents, shouldn't we use the inverse *token* frequency (i.e. the average relative frequency of tokens across documents in the corpus)? Why do we use the inverse document frequency? The answer: TF-IDF does not make the BOW assumption. If all documents were just bags of words, the frequency within documents would be distributed similarly to the frequency between documents (e.g. since the bigram “verbal reasoning” occurs in very few tweets, you would expect the probability of it occurring twice in the same document to be near-impossible). In reality though, it wouldn’t be surprising to see “verbal reasoning” even 3 times in the same tweet if that tweet were discussing verbal reasoning. Multiplying relative *term frequency* by inverse *document frequency* provides a measure of exactly how wrong BOW is in this case. In other words, how topical is the token?
In Quanteda, we can convert a DFM to TF-IDF with the `dfm_tfidf()` function:
```{r}
hippocorpus_dfm_tfidf <- hippocorpus_dfm |>
dfm_tfidf(scheme_tf = "prop")
print(hippocorpus_dfm_tfidf, max_ndoc = 6, max_nfeat = 4)
```
Overall, TF-IDF combines the advantages of many simpler word count transformations without many of their downsides. For example, removing stop words focuses the analysis on text content, but may remove important information. TF-IDF discounts uninformative tokens like stop words without removing them outright. Likewise, we saw that binary tokenization solves part of the problem with the BOW assumption, but throws out a lot of information in the process. Once again, TF-IDF turns this problem into a feature (quantifying how topical a token is) without throwing out any information.
Despite all of its benefits, TF-IDF is not widely used in psychology research. This is for two reasons. First, it is difficult to interpret intuitively, and researchers prize interpretability. Second, it focuses the analysis on the *topic* of texts rather than the subtleties of language style that are often associated with psychological differences (see @sec-stopwords). Nevertheless, if your construct is of a less subtle nature---more akin to the "topic" of the text---consider using TF-IDF.
::: {.callout-tip icon="false"}
## Advantages of TF-IDF
- **Computational Efficiency**
- **Discounts Uninformative Tokens Without Losing Information:** TF-IDF combines the advantages of stop word removal without removing tokens from the analysis.
- **Does Not Rely on BOW:** TF-IDF leverages the discrepancy between term frequency and document frequency to discover patterns of meaning.
- **Proven to Work:** TF-IDF has a long history of use in search engines and automated recommendations. It works surprisingly well for such a simple calculation.
:::
::: {.callout-important icon="false"}
## Disadvantages of TF-IDF
- **May Discount Psychologically Relevant Informartion:** TF-IDF operates on an assumption that more frequent tokens (across documents) are less relevant to the construct in question. For semantics this is largely true, but for latent psychological constructs it may not be.
:::
## Smoothing {#sec-smoothing}
In @sec-relative-tokenization, we mentioned that dividing by text length does not adequately control for text length. This is true of ratios in general, but for word frequencies in particular there is extra cause for concern.
Consider this short text written by an imaginary participant, Allison:
| _I was sitting at the table, and **suddenly** I understood._
The text has 10 words in total. One of these, "suddenly," is a surprise-related word. Given only this text, you might estimate the probability of surprise words in Allison's language at 1/10, the relative frequency. You would be wrong. To see why, imagine that we also wanted to measure the probability of anger-related words. There are no anger-related words in this text. Is the probability of anger-related words then 0/10 = 0? Of course not. If we read more of Allison's texts, we might very well encounter an anger-related word. Relative frequency therefore underestimates the probability of unobserved words. Conversely, it must overestimate the probability of observed words. So to leave room for the probability of unobserved words, we must admit that the true probability of anger-related words is likely to be a little less than 1/10.
The first solution to this kind of problem was offered by @laplace_1816, who proposed to simply add one to all the frequency counts before computing the relative frequency. This is called **Laplace smoothing**, or sometimes simply _add-one smoothing_. To perform Laplace smoothing in Quanteda, use the `dfm_smooth()` function to add one, and then call `dfm_weight()` as before.
```{r}
hippocorpus_dfm_laplace <- hippocorpus_dfm |>
dfm_smooth(smoothing = 1) |>
dfm_weight(scheme = "prop")
print(hippocorpus_dfm_laplace, max_ndoc = 6, max_nfeat = 4)
```
Laplace smoothing is better than nothing, but it is designed for a situation in which the number of possible token types is small and known. In the case of natural language, the number of possible tokens is extremely large and entirely unknown [@baayen_2001]. You can partially make up for this by adding a very small smoothing number instead of 1 (e.g. 1 divided by the number of token types in the corpus: `smoothing = 1/ncol(hippocorpus_dfm)`), but if you are willing to invest a bit more computation time, there are better ways.
During World War II, the German navy encrypted and decrypted its communications using the Enigma machine, which scrambled and unscrambled messages according to an initial input setting. This input setting was updated each day, and part of the process required the operator of the machine to select a three-letter sequence from a large book “at random” [@good_2000]. Alan Turing, who was in charge of the decryption effort on the part of the British, quickly understood that the German operators’ choices from this book were not entirely random, and that patterns in their choices could be exploited to narrow down the search. The problem became how to estimate the probability of each three-letter sequence based on a relatively small sample of previously decoded input settings—a sample much smaller than the number of three-letter sequences in the book (which, like the number of possible tokens in English text, was extremely large). Turing solved this problem---essentially the same problem posed by word frequencies in text---by developing a method that used frequencies of frequencies in the sample to estimate the total probability of yet-unseen sequences and correct the observed frequencies based on this estimate. The algorithm was later refined and published by Turing’s assistant I. J. Good [-@good_1953], and has since seen many variations. Today, the most popular of these variations is the **Simple Good-Turing** algorithm [@gale_sampson_1995]. Unfortunately, Simple Good-Turing smoothing is not yet implemented in Quanteda, but that is expected to change in the coming months. As soon as it does, we will add it to the textbook.
While smoothing may seem unnecessarily complex, it can be an important safeguard against false positives when analyzing word counts or frequencies. This is because the bias that smoothing accounts for (due to unobserved tokens) is stronger for shorter texts, and grows in a non-linear pattern [@baayen_2001]. This bias can become a confounding factor in regression models, especially when a predictor variable is strongly correlated with text length.
::: {.callout-tip icon="false"}
## Advantages of Smoothing
- **Better Estimates of Token Probabilities**
- **Corrects Text-Length-Dependent Bias: ** This bias can cause false positive results in regression analyses.
:::
::: {.callout-important icon="false"}
## Disadvantages of Smoothing
- **Computationally Inefficient: ** Laplace smoothing is much more efficient than Simple Good-Turing, but it does not perform as well.
:::
## Machine Learning Approaches {#sec-machine-learning-word-counts}
This chapter is dedicated to methods that can be carried out after computing word counts (@sec-tokenization) and before analysis (@sec-word-counting; @sec-dla). These methods can be thought of as transformations that we apply to word counts to make them better at measuring what we want them to measure. The last of these transformations is the most elaborate one: supervised machine learning.
For supervised machine learning, you need a training dataset of text that is already labeled with your construct of interest. We already saw a dataset like this for the emotion of surprise in @sec-generating-dictionaries: [Crowdflower Emotion in Text dataset](https://data.world/crowdflower/sentiment-analysis-in-text). Many other good examples can be found in @sec-corpora. The labels could be generated by participants who read the texts, or by the authors of the texts themselves (e.g. in the form of questionnaires that measure their personality).
Once you have your training dataset, compute word counts for its texts (@sec-tokenization). These could be dictionary-based counts from a variety of different dictionaries, or they could be individual token counts in the form of a DFM. You can even transform these counts in some of the ways described in this chapter (or even more than one). These counts will become the predictor variables for your machine learning model.
This book is not the place an in-depth tutorial on machine learning^[For a good overview of different methods, see @giuntini_etal_2020], but we will give a brief example. Recall that in @sec-generating-dictionaries we used the [Crowdflower Emotion in Text dataset](https://data.world/crowdflower/sentiment-analysis-in-text) to find the words most indicative of surprise in tweets (based on PMI). In @sec-generating-dictionaries we used these words as a dictionary, but we could get more robust results by using them as predictor variables to train a regularized machine learning model, in this case ridge regression, although [there are many other options](https://topepo.github.io/caret/available-models.html).
```{r}
#| warning: false
# New Hippocorpus DFM
hippocorpus_dfm_ngrams <- hippocorpus_corp |>
tokens(remove_punct = TRUE) |>
tokens_ngrams(n = c(1L, 2L)) |> # 1-grams and 2-grams
dfm() |>
dfm_weight(scheme = "prop") # relative tokenization
# Select only high PMI tokens from Crowdflower
hippocorpus_dfm_surprise <- hippocorpus_dfm_ngrams |>
dfm_select(tweet_surprise_dict)
# Crowdflower DFM with only tokens that appear in Hippocorpus DFM
crowdflower_dfm <- crowdflower_dfm |>
dfm_weight(scheme = "prop") |> # relative tokenization
dfm_select(featnames(hippocorpus_dfm_surprise))
# Rejoin to Crowdflower DFM labeled data
crowdflower_train <- crowdflower |>
mutate(
doc_id = as.character(tweet_id),
label_surprise = as.integer(sentiment == "surprise")
) |>
select(doc_id, label_surprise) |>
left_join(
convert(crowdflower_dfm, "data.frame"),
by = "doc_id"
) |>
select(-doc_id)
# Balanced training dataset
# set.seed(2)
# crowdflower_train <- crowdflower_train |>
# group_by(label_surprise) |>
# slice_sample(n = sum(crowdflower_train$label_surprise==1))
# Ridge Regression (10-fold cross-validation)
library(caret)
tg <- expand.grid(
alpha = 0,
lambda = c(2 ^ seq(-1, -6, length = 20))
)
set.seed(2)
surprise_ridge <- train(
label_surprise ~ .,
data = crowdflower_train,
method = "glmnet",
tuneGrid = tg,
trControl = trainControl("cv", number = 10)
)
# Prepare Hippocorpus data to run model
hippocorpus_features <- hippocorpus_dfm_surprise |>
convert("data.frame") |>
left_join(
convert(hippocorpus_corp, "data.frame"),
by = "doc_id"
)
# Run model on Hippocorpus for surprise estimation
surprise_pred <- predict(surprise_ridge, newdata = hippocorpus_features)
hippocorpus_features <- hippocorpus_features |>
mutate(surprise_pred = surprise_pred)
```
Now that we have a new machine-learning-powered estimate of surprise for the Hippocorpus data, we can retest our hypothesis that true autobiographical stories include more surprise than imagined stories.
```{r}
surprise_mod_ml <- lm(surprise_pred ~ memType,
data = hippocorpus_features)
summary(surprise_mod_ml)
```
We find that imagined stories have significantly less surprise-related language than autobiographical stories (p = `r round(summary(surprise_mod_ml)$coefficients["memTyperecalled","Pr(>|t|)"],3)`)!
Despite the exciting result, we should be careful with this newfangled approach. As with dictionary-based methods, beware of problems with generalization---there is no guarantee that surprise in Tweets will look similar to surprise in autobiographical accounts. Likewise, keep in mind all of the regular challenges of machine learning. Notice for example that the intercept of this model is extremely low (`r round(summary(surprise_mod_ml)$coefficients["(Intercept)","Estimate"],3)`; surprise is measured between 0 and 1). This reflects the fact that the Crowdflower dataset is not balanced; there are very few Tweets labeled with surprise relative to the size of the dataset.
**An example of machine learning approaches in research:** @zamani_etal_2018 extracted n-grams from Facebook status updates. They then computed TF-IDF scores, binary tokenization, and LDA topics (@sec-lda), subjected all of these values to a dimensionality reduction process to reduce overfitting, and used the resulting features to train a ridge regression model to predict questionnaire-based measures of trustfulness. This regression model could then be used on novel texts to estimate the trustworthiness of their authors.
::: {.callout-tip icon="false"}
## Advantages of Machine Learning Approaches
- **Accuracy**
- **Regularization:** Machine learning algorithms focus on mitigating the influence of outliers. This can sometimes help generalize across datasets too.
- **Avoid Statistical Troubles:** The output of machine learning models is often continuous and more or less normally distributed. This means standard linear regression is usually sufficient for hypothesis testing.
:::
::: {.callout-important icon="false"}
## Disadvantages of Machine Learning Approaches
- **Require a Relevant Training Dataset**
- **Difficult to Interpret**
- **May Fail to Generalize Across Datasets**
:::
---