navigating-vectorspace.qmd

---
include-in-header:
  - text: |
      <style>
      .r-output code {
        word-break: break-wor !important;
        white-space: pre-wrap !important;
      }
      </style>
---

# Navigating Vector Space {#sec-navigating-vectorspace}

```{r setup}
#| echo: false
#| include: false

source("_common.R")
library(quanteda)
library(word2vec)

source("vector_scripts.R")

# prep word2vec
word2vec_mod <- "data/GoogleNews-vectors-negative300.bin"
word2vec_mod <- read.word2vec(file = word2vec_mod, normalize = TRUE)

# prep Hippocorpus
hippocorpus_df <- read_csv("data/hippocorpus-u20220112/hcV3-stories.csv") |> 
  select(AssignmentId, story, memType, summary, WorkerId, 
         annotatorGender, openness, timeSinceEvent)

hippocorpus_corp <- hippocorpus_df |> 
  corpus(docid_field = "AssignmentId", 
         text_field = "story")

hippocorpus_dfm <- hippocorpus_corp |> 
  tokens(remove_punct = TRUE) |> 
  dfm() |> 
  dfm_remove("~")

# SBERT embeddings
hippocorpus_sbert <- readRDS("data/hippocorpus_sbert.rds")
hippocorpus_sbert <- hippocorpus_df |>
  rename(ID = AssignmentId) |> 
  left_join(hippocorpus_sbert)

surprise_ccr <- readRDS("data/surprise_sbert.rds")
surprise_neg_ccr <- readRDS("data/surprise_neg_sbert.rds")

# for ANSI rendering
ansi_aware_handler <- function(x, options) {
  paste0(
    "<pre class=\"r-output\"><code>",
    fansi::sgr_to_html(x = x, warn = FALSE, term.cap = "256"),
    "</code></pre>"
  )
}
knitr::knit_hooks$set(
  output = ansi_aware_handler, 
  message = ansi_aware_handler, 
  warning = ansi_aware_handler,
  error = ansi_aware_handler
)
```

## Representing Psychological Constructs

In @sec-decontextualized-embeddings we measured the surprise in texts by comparing their embeddings to that of a single word: "surprised". But does the embedding of the word "surprised" fully capture the concept of surprise as an emotion? Faced with this question of construct validity, we have two options:

1.    **Conduct a Validation Study:** We could find or construct a dataset of texts that were rated by a human (or ideally, multiple humans) on the extent to which they reflect the emotion of surprise. We could then compare our embedding-based surprise scores to the human ratings.
2.    **Use an Already-Validated Construct Definition:** Properly validating a new measure is hard work. When possible, psychology researchers often prefer to use an existing measure that has already been carefully validated in the past. 

The second option may seem difficult, since embeddings are very new to the field, so few if any validated vector representations of constructs are available. As it turns out, this is not a problem---any language-based psychological measure can be represented as a vector! Psychology has used language-based measures like dictionaries and questionnaires for over a century. To smoothly continue this existing research in the age of vector spaces, let's consider how to translate between the two.

### Distributed Dictionary Representation (DDR) {#sec-ddr}

Let's begin with a straightforward sort of psychological measure---the dictionary. We have already discussed dictionaries extensively in @sec-word-counting and noted that psychology researchers have been constructing, validating, and publicizing dictionaries for decades (@sec-dictionary-sources). But these dictionaries are designed for word counting---How do we apply them to a vector-based analysis? @garten_etal_2018 propose a simple solution: Get word embeddings (@sec-word-embeddings) for each word in the dictionary, and average them together to create a single Distributed Dictionary Representation (DDR). The dictionary construct can then be measured by comparing text embeddings to the DDR.

DDR cannot entirely replace word counts; for linguistic concepts like pronoun use or the passive voice, dictionary-based word counts are still necessary. But DDR is ideal for studies of abstract constructs like emotions, that refer to the general gist of a text rather than particular words. The rich representation of word embeddings allows DDR to capture even the subtlest associations between words and constructs, and to precisely reflect the _extent_ to which each word is associated with each construct. It can do this even for texts that do not contain any dictionary words. Because embeddings are continuous and already calibrated to the probabilities of word use in language, DDR also avoids the difficult statistical problems that arise due to the strange distributions of word counts (@sec-word-counting-improvements).

@garten_etal_2018 found that DDR works best with smaller dictionaries of only the words most directly connected to the construct being measured (around 30 words worked best in their experiments). Word embeddings work by overvaluing informative words (@sec-embedding-magnitude)---a desirable property for raw texts, in which uninformative words tend to be very frequent. But dictionaries only include one of each word. In longer dictionaries with more infrequent, tangentially connected words, averaging word embeddings will therefore _overvalue_ those infrequent words and skew the DDR. This can be fixed with Garten et al.'s method of picking out only the most informative words. Alternatively, it could be fixed by measuring the frequency of each dictionary word in a corpus and weighting the average embedding by that frequency. This method is actually more consistent with the way most dictionaries are validated, by counting the frequencies of dictionary words in text (@sec-word-counting).

[^ddr-1]: For more information on this property, see our [footnote](decontextualized-embeddings.html#fn7) in @sec-word2vec. Note that this property emerges naturally from the way decontextualized models like word2vec and GloVe are trained, and therefore may not hold true for contextualized embeddings.

Let's measure surprise in the Hippocorpus texts by computing a DDR of the NRC Word-Emotion Association Lexicon [@mohammad_turney_2010; @mohammad_turney_2013], which we used in @sec-word-counting. To correct for word informativeness, we will weight the dictionary word embeddings by their frequency in the corpus.

```{r}
#| output: false
# load surprise dictionary
surprise_dict <- quanteda.sentiment::data_dictionary_NRC["surprise"]

# estimate frequency of dictionary words
surprise_dict_freqs <- hippocorpus_dfm |> 
  dfm_keep(surprise_dict$surprise) |> 
  quanteda.textstats::textstat_frequency() |> 
  select(feature, frequency)

# word2vec embeddings of dictionary words
surprise_ddr <- predict(word2vec_mod, surprise_dict$surprise, type = "embedding") |> 
  as_tibble(rownames = "feature") |> 
  left_join(surprise_dict_freqs) |> 
  replace_na(list(frequency = 0))

# average dictionary embedding (weighted by frequency)
surprise_ddr <- surprise_ddr |> 
  summarise(across(V1:V300, ~weighted.mean(.x, frequency, na.rm = TRUE))) |> 
  select(V1:V300) |> 
  unlist()

# document embeddings
hippocorpus_word2vec <- hippocorpus_dfm |> 
  textstat_embedding(word2vec_mod)

# score documents by surprise
hippocorpus_surprise_ddr <- hippocorpus_word2vec |> 
  rowwise() |> 
  mutate(
    surprise = cos_sim(c_across(V1:V300), surprise_ddr),
    # transform cosine similarity to stay between 0 and 1
    surprise = surprise/2 + 1/2
  ) |> 
  ungroup() |> 
  select(-c(V1:V300))

# rejoin docvars
hippocorpus_surprise_ddr <- hippocorpus_surprise_ddr |> 
  bind_cols(docvars(hippocorpus_corp))
```

With the new measure of surprise, we can retest the hypothesis that true autobiographical stories include more surprise than imagined stories. 

```{r}
#| warning: false
# beta regression
surprise_mod_ddr <- betareg::betareg(
  surprise ~ memType, 
  data = hippocorpus_surprise_ddr
  )

summary(surprise_mod_ddr)
```

We again find significant differences in surprise between imagined and recalled stories, in the opposite of the expected direction. This is different from our results in @sec-word-counting, where we tested the same hypothesis with the same dictionary, but used word counts rather than embeddings.

#### DDR for Word-by-Word Analysis

Another advantage of DDR over dictionary-based word counts is that DDR enables word-by-word analysis of text. It is not very informative to count how many surprise words are in each word (it will either be one or zero), but we can compare the embedding of each word to the surprise DDR---how close are they in the vector space? This allows us to see how a construct spreads out within a single text. As an example, let's take a single story from the Hippocorpus:

```{r}
# full text as string
story <- word(hippocorpus_df$story[3], end = 140L)

cat(story)
```

To visualize surprise within this text, we can separate it into words and find the embedding of each word. Rather than averaging all of these embeddings together to get the embedding of the full text, we can compute a rolling average, averaging each word's embedding with those of its neighbors.

```{r}
# separate into vector of tokens
story <- word(hippocorpus_df$story[3], end = 140L) |> 
  tokens() |> as.character()
  
# rolling average of embeddings
story_surprise <- as_tibble(predict(word2vec_mod, story, type = "embedding")) |> 
  mutate(
    across(
      V1:V300, 
      ~zoo::rollapply(
        .x, 4, mean, na.rm = TRUE, 
        align = "center",
        fill = c(head(.x, 1), NA, tail(.x, 1))
        )
      )
    )
  
# vector of computed surprise (cosine similarity)
story_surprise <- story_surprise |> 
  rowwise() |> 
  mutate(surprise = cos_sim(c_across(V1:V300), surprise_ddr)) |> 
  pull(surprise)
```

We can now visualize the surprise in each word of the text. Since `ggplot2` makes it difficult to plot dynamically colored text in one continuous chunk, we will use ANSI color codes to print the text directly to the console.

```{r}
#| code-overflow: wrap
# (see https://www.hackitu.de/termcolor256/ for info on ANSI colors)
# blue-red heat scale
ansi_scale <- c(
  063, 105, 147, 189, 188, 230, 223, 
  224, 217, 210, 203, 196, 160, 124
  )

# turn scale value into ANSI color code
map_to_ansi <- function(x, ansi_scale){
  x_new <- (x - min(x, na.rm = TRUE))*(length(ansi_scale)/diff(range(x, na.rm = TRUE))) + 1
  x_new
  ansi_scale[round(x_new)]
}

story_surprise <- map_to_ansi(story_surprise, ansi_scale)

# print
for (i in 1:length(story_surprise)) {
  if(is.na(story_surprise[i])){
    cat(story[i], " ")
  }else{
    cat(paste0("\033[48;5;", story_surprise[i], "m", story[i], " \033[0m"))
  }
}
```

::: {.callout-tip icon="false"}
## Advantages of DDR

-   **Richer, More Robust Construct Representation Than Word Counting**
-   **Avoids Statistical Problems With Word Count Distributions**
-   **Enables Word-by-Word Analysis**
-   **Works Well With Short Dictionaries:** DDR only needs a dictionary that captures the essence of the construct being measured. For many constructs, this could be only a few words. You can even ask an LLM chatbot to generate a list of words that people high or low in a certain construct might use.
:::

::: {.callout-important icon="false"}
## Disadvantages of DDR

-   **Can Implicitly Encode Associated Constructs:** For example, if surprised texts tend to have positive valence in the data used to train the word embedding model, the DDR for surprise may embed some positive valence as well. This can be remedied by constructing a DDR for positive valence too, and using it as a statistical control when testing hypotheses.
-   **May Not Work With Contextualized Embeddings:** Even if we assume that contextualized embeddings (@sec-contextualized-embeddings) conform to the geometrical properties associated with word embeddings, LLMs are not designed to embed single words, which is required for DDR.
-   **Not Appropriate for Linguistic Measures:** Word embeddings encode the general gist of a text, whereas constructs like passive voice or pronoun use refer to specific words.
:::

### Contextualized Construct Representation (CCR) {#sec-ccr}

Dictionaries are not the only validated psychological measures that we can apply using embeddings. With contextualized embeddings, we can extract the gist of any text and compare it to that of any other text (@sec-contextualized-embeddings). @atari_etal_2023 propose to do this with the most popular form of psychometric scale: the questionnaire. Psychologists have been using questionnaires to measure things for over a century, and tens of thousands of validated questionnaires are now available [online](https://www.apa.org/pubs/databases/psyctests). The LLM embedding of a questionnaire is referred to as a Contextualized Construct Representation (CCR).

We can use CCR to measure surprise in the Hippocorpus texts. For our questionnaire, we will use an adapted version of the surprise scale used by @choi_choi_2010 and @choi_nisbett_2000.

```{r}
surprise_items <- c(
  "I was extremely surprised by the outcome of the event.",
  "The outcome of the event was extremely interesting.",
  "The outcome of the event was extremely new."
  )
```

::: {.callout-important}
## Beware of Reverse Coding!

Many questionnaires include reverse-coded items (e.g. "I often feel happy" on a depression questionnaire). The easiest way to deal with these is to manually add negations to flip their meaning (e.g. "I _do not_ often feel happy"). 
:::

The first step in using CCR is to compute contextualized embeddings for the texts in the Hippocorpus dataset. We already did this in @sec-contextualized-embeddings. The next step is to compute contextualized embeddings for the items in the questionnaire, and average them to produce a CCR.

```{r}
#| eval: false
# embed items (using the same model as we used before)
library(text)
  
surprise_sbert <- textEmbed(
  surprise_items,
  model = "sentence-transformers/all-MiniLM-L12-v2", # model name
  layers = -2,  # second to last layer (default)
  tokens_select = "[CLS]", # use only [CLS] token
  dim_name = FALSE,
  keep_token_embeddings = FALSE
  )

# compute CCR by averaging item embeddings
surprise_ccr <- surprise_sbert$texts[[1]] |>
  summarise(across(everything(), mean)) |> 
  unlist()
```

We can now measure surprise in the Hippocorpus texts by computing the cosine similarity between their embeddings and the surprise CCR.^[Cosine similarity is appropriate here because our contextualized embeddings were generated by an SBERT model, which was designed to be used with cosine similarity. If we had used another model such as RoBERTa, Euclidean distance might be more appropriate.]

```{r}
#| warning: false
# score documents by surprise
hippocorpus_surprise_ccr <- hippocorpus_sbert |> 
  rowwise() |> 
  mutate(
    surprise = cos_sim(c_across(Dim1:Dim384), surprise_ccr),
    # transform cosine similarity to stay between 0 and 1
    surprise = surprise/2 + 1/2
  ) |> 
  ungroup() |> 
  select(-c(Dim1:Dim384))

# beta regression
surprise_mod_ccr <- betareg::betareg(
  surprise ~ memType,
  hippocorpus_surprise_ccr
  )

summary(surprise_mod_ccr)
```

Once again, a significant difference in surprise between remembered and recalled stories in the opposite of the expected direction. However, CCR has a fundamental problem that needs to be addressed.

Embeddings capture the overall "vibes" of a text, including its tone and dialect. With CCR, we are comparing the "vibes" of a questionnaire written by academics to the "vibes" of narratives written by Hippocorpus participants. By comparing these vectors, we are not just measuring how much surprise is in each text---we are also measuring the extent to which each text is in the style of a questionnaire written by academics. This introduces a confounding variable into our analysis---questionnaire-ness.

The questionnaire-ness problem means that CCR is most effective for analyzing texts that bear a strong similarity to the questionnaire itself. For example, if you are analyzing participant descriptions of their own values, and your questionnaire items are statements about values in the first person (as any questionnaires are), CCR is likely to work well, especially with the improvement described in @sec-dimension-projection and @sec-ccr-improvement. With this method, you can compare participant responses to the questionnaire without actually administering the questionnaire itself; participants can answer in their own words, which CCR will compare to the wording of the questionnaire. 

::: {.callout-tip icon="false"}
## Advantages of CCR

-   **Can Apply Existing Questionnaires**
-   **Effectively Uses Contextualized Embeddings**
-   **Allows Free Response Items:** Compares free-written participant responses with questionnaire wording.
:::

::: {.callout-important icon="false"}
## Disadvantages of CCR

-   **Limited Applicability:** Less effective on texts that do not contain similar content and wording to the questionnaires
-   **Risks Measuring Questionnaire-ness:** This risk can be mitigated by using an anchored vector (@sec-ccr-improvement) 
:::

## Reasoning in Vector Space: Beyond Cosine Similarity and Dot Products

### Additive Analogies {#sec-parallelograms}

Nearly every introduction to word embeddings opens with their analogical property. This is for good reason: it is extremely cool. Embeddings can be added to each other in order to arrive at new concepts. Here's an example, using word2vec embeddings reduced to two dimensions with PCA:

```{r}
#| echo: false
analogy_words <- c("uncle", "aunt", "man", "woman")

analogy_pca <- predict(word2vec_mod, analogy_words, type = "embedding") |> 
  as_tibble(rownames = "word") |> 
  reduce_dimensionality(V1:V300, reduce_to = 2)

woman_vec <- analogy_pca[analogy_words == "woman",2:3]
man_vec <- analogy_pca[analogy_words == "man",2:3]
aunt_vec <- analogy_pca[analogy_words == "aunt",2:3]

gender <- woman_vec - man_vec
parentsibling <- woman_vec - aunt_vec

analogy_pca |> 
  ggplot(aes(PC1, PC2, label = word)) +
    geom_segment(
      aes(xend = PC1 - gender$PC1, 
          yend = PC2 - gender$PC2),
      arrow = arrow(type = "closed"),
      linewidth = 2,
      color = "navyblue",
      data = analogy_pca |> filter(word %in% c("woman", "aunt"))
      ) +
    geom_segment(
      aes(xend = PC1 - parentsibling$PC1, 
          yend = PC2 - parentsibling$PC2),
      arrow = arrow(type = "closed"),
      linewidth = 2,
      color = "red4",
      data = analogy_pca |> filter(word %in% c("woman", "man"))
      ) +
    geom_label() +
    coord_cartesian(xlim = c(-10, 10), ylim = c(-6, 6)) +
    theme_bw()
```

If we subtract the embedding of "man" from the embedding of "woman", we get the vector shown in blue. This vector represents the move from male to female gender. A vector between two embeddings is called an **anchored vector**. So when we add the man-woman anchored vector to the embedding of "aunt", we get very close to the embedding of "uncle". This property was first noted in word2vec [@mikolov_etal_2013], and GloVe [@pennington_etal_2014] was specifically designed with it in mind. 

::: {.callout-tip}
## Additive Analogies in Contextualized Embeddings

Notice that the analogical property relies on the magnitude of the vectors---if some vectors were shorter or longer than necessary, the parallelogram would not fit. This means that analogical reasoning may not be applicable to LLM embeddings, which are often organized in nonlinear patterns [@cai_etal_2021; @ethayarajh_2019; @gao_etal_2019]. Even specialized models like SBERT are generally not designed with the additive analogical property in mind [@reimers_gurevych_2019]. Even though some geometrically motivated methods work fairly well in LLM embeddings, as we will see in @sec-ccr-improvement, there is lots of room for improvement in this area.[^analogies-1]
:::

[^analogies-1]: There are some promising methods for getting more geometrically regular embeddings out of LLMs. For example, averaging the last two layers of the model seems to help [@li_etal_2020]. Taking a different approach, @ethayarajh_2019 created static word embeddings from an LLM by running it on a large corpus and taking the set of each word’s contextualized representations from all the places it appears in the corpus. The loadings of the first principal component of this set represent the dimensions along which the meaning of the word changes across different contexts. These loadings can themselves be used as a vector embedding which can out-perform GloVe and FastText embeddings on many word vector benchmarks, including analogy solving. This approach worked best for embeddings from the early layers of the LLM.

The simplest application of the analogical property is to complete analogies like "telescope is to astronomy as ________ is to psychology." You can find word2vec's answer to this puzzle by subtracting the embedding of "telescope" from the embedding of "astronomy", adding the result to the embedding of "psychology", and finding the embedding with the lowest Euclidean distance to that vector. 

### Anchored Vectors For Better Construct Representations {#sec-dimension-projection}

There is a fundamental problem with all embeddings that additive analogical reasoning can help us solve. Consider the embeddings for "happy" and "sad". These may seem like opposites, but actually they are likely to be very close to each other in vector space because they both relate to emotional valence. This means that if we try to measure the happiness of words by comparing their embeddings to the embedding for "happy", we will actually be measuring the extent to which the words relate to emotion in general. The word "depression" might seem happier than the word "table", since depression is more emotion-related. This problem can be solved by using **anchored vectors**. Just like we created an anchored vector between "man" and "woman" to represent masculinity (as opposed to femininity), we can create an anchored vector between "happy" and "sad" to represent happiness (as opposed to sadness). As we saw in @sec-parallelograms, anchored vectors can be applied wherever necessary in embedding space. 

```{r}
happy_vec <- predict(word2vec_mod, "happy", type = "embedding")
sad_vec <- predict(word2vec_mod, "sad", type = "embedding")

happiness_anchor <- happy_vec - sad_vec
```

To measure constructs with an anchored vector, take the dot product of your text embeddings with the anchored vector. This is the equivalent of "projecting" the embeddings down onto the scale between one end of the anchored vector and the other.[^anchored-vecs-1] 

[^anchored-vecs-1]: For an intuitive explanation of why the dot product is equivalent to a projection, see [3blue1brown's video on the subject.](https://youtu.be/LyGKycYT2v0?si=86cfrN9DP9xw5HUx). Incidentally, the dot product with the anchored vector is also equivalent to the dot product with the positive embedding (e.g. "happy") minus the dot product with the negative vector (e.g. "sad").

```{r}
#| echo: false
#| warning: false
example_words <- c(
  "ecstatic", "angry", "depressed", "frustrated", "pleased", "satisfied",
  "bored", "glad", "excited", "happy", "sad", "devastated"
  )

happiness_pca <- predict(word2vec_mod, example_words, type = "embedding") |> 
  as_tibble(rownames = "word") |> 
  reduce_dimensionality(V1:V300, reduce_to = 2)

happy_vec_pca <- happiness_pca |> 
  filter(word == "happy") |> 
  select(-word) |> 
  unlist()

sad_vec_pca <- happiness_pca |> 
  filter(word == "sad") |> 
  select(-word) |> 
  unlist()
  
happiness_pca |> 
  bind_cols(
    project_points_onto_line(
      happy_vec_pca, 
      sad_vec_pca, 
      happiness_pca |> select(-word)
      )
    ) |> 
  ggplot(
    aes(
      PC1, PC2, 
      xend = X1, yend = X2,
      label = word, 
      fill = word %in% c("happy", "sad")
      )
    ) +
    geom_segment(
      aes(xend = lead(PC1), yend = lead(PC2)),
      data = happiness_pca |> filter(word %in% c("happy", "sad"))
      ) +
    geom_segment(color = "skyblue2", linetype = 2) +
    geom_point(aes(X1, X2), color = "skyblue3") +
    geom_label() +
    scale_fill_manual(values = c("white", "skyblue2")) +
    coord_equal(xlim = c(-14, 14), ylim = c(-12, 12)) +
    guides(fill = "none") +
    theme_bw()
```

By projecting each embedding down onto the anchored vector between happy and sad, we create a scale from happy to sad.[^anchored-vecs-2] This is sometimes referred to as **semantic projection** [@grand_etal_2022]. 

[^anchored-vecs-2]: Taking the dot product with an anchored vector yields an unstandardized version of this scale. If you want "sad" to be 0 and "happy" to be 1 on the scale, use the `anchored_sim()` function included in [our Github repo](https://github.com/rimonim/ds4psych/blob/main/vector_scripts.R). 

#### Improving DDR With Anchored Vectors {#sec-ddr-improvement}

In @sec-polarity, we used two dictionaries to measure surprise _as opposed to anticipation_ with word counts. By creating an anchored vector between surprise and anticipation, we can now replicate that analysis using DDR. The first step is to create a DDR for each dictionary. Since we already have one for surprise from @sec-ddr, we just need to replicate the process for anticipation.

```{r}
#| output: false
# get dictionary
anticipation_dict <- quanteda.sentiment::data_dictionary_NRC$anticipation

# estimate frequency of dictionary words
anticipation_dict_freqs <- hippocorpus_dfm |> 
  dfm_keep(anticipation_dict) |> 
  quanteda.textstats::textstat_frequency() |> 
  select(feature, frequency)

# word2vec embeddings of dictionary words
anticipation_ddr <- predict(word2vec_mod, anticipation_dict, type = "embedding") |> 
  as_tibble(rownames = "feature") |> 
  left_join(anticipation_dict_freqs) |> 
  replace_na(list(frequency = 0))

# average dictionary embedding (weighted by frequency)
anticipation_ddr <- anticipation_ddr |> 
  summarise(across(V1:V300, ~weighted.mean(.x, frequency, na.rm = TRUE))) |> 
  select(V1:V300) |> 
  unlist()
```

Now that we have DDRs for both surprise and anticipation, we can create an anchored vector between them:

```{r}
surprise_ddr_anchored <- surprise_ddr - anticipation_ddr
```

We can now score the Hippocorpus texts by the dot product between their word2vec embeddings and the anchored vector, effectively projecting each one onto a scale between anticipation and surprise.

```{r}
# score documents by surprise
hippocorpus_surprise_ddr_anchored <- hippocorpus_word2vec |> 
  rowwise() |> 
  mutate(surprise = dot_prod(c_across(V1:V300), surprise_ddr_anchored)) |> 
  ungroup() |> 
  select(-c(V1:V300))

# rejoin docvars
hippocorpus_surprise_ddr_anchored <- hippocorpus_surprise_ddr_anchored |> 
  bind_cols(docvars(hippocorpus_corp))
```

Since the scale is theoretically infinite (a text could have more surprise than the average dictionary embedding for surprise), we can analyze it with a standard linear regression.

```{r}
surprise_mod_ddr_anchored <- lm(
  surprise ~ memType,
  data = hippocorpus_surprise_ddr_anchored
  )

summary(surprise_mod_ddr_anchored)
```

We found no significant difference between imagined and recalled stories, but we did find a significant difference between imagined and retold stories such that retold stories had slightly less surprise as opposed to anticipation.

#### Improving CCR With Anchored Vectors {#sec-ccr-improvement}

Remember the questionnaire-ness problem with CCR from @sec-ccr? Anchored vectors can help us solve this problem. This time, let's just negate each item from the surprise questionnaire, like this:

```{r}
surprise_items_pos <- c(
  "I was extremely surprised by the outcome of the event.",
  "The outcome of the event was extremely interesting.",
  "The outcome of the event was extremely new."
  )

surprise_items_neg <- c(
  "I was not surprised at all by the outcome of the event.",
  "The outcome of the event was not interesting at all.",
  "The outcome of the event was not new at all."
  )
```

This approach has the advantage of maintaining most of the original wording. By creating an anchored vector between the positive and negative CCRs, we can disregard this questionnaire-y wording, focusing only on the direction between lots of surprise and no surprise at all. Even though this approach makes big assumptions about the linearity of the contextualized embedding space (@sec-parallelograms), it has been shown to work fairly well for a variety of constructs and models [@grand_etal_2022]. It is particularly applicable to the Hippocorpus data, since the texts are first-person narratives about an event, just like the questionnaire items.

Let's create the new anchored CCR and use it to reanalyze the Hippocorpus data.

```{r}
#| eval: false
# embed items (using the same model as we used before)
library(text)
  
surprise_neg_sbert <- textEmbed(
  surprise_items_neg,
  model = "sentence-transformers/all-MiniLM-L12-v2", # model name
  layers = -2,  # second to last layer (default)
  tokens_select = "[CLS]", # use only [CLS] token
  dim_name = FALSE,
  keep_token_embeddings = FALSE
  )

# compute negative CCR by averaging item embeddings
surprise_neg_ccr <- surprise_neg_sbert$texts[[1]] |>
  summarise(across(everything(), mean)) |> 
  unlist()
```

```{r}
surprise_ccr_anchored <- surprise_ccr - surprise_neg_ccr

# score documents by surprise
hippocorpus_surprise_ccr_anchored <- hippocorpus_sbert |> 
  rowwise() |> 
  mutate(surprise = dot_prod(c_across(Dim1:Dim384), surprise_ccr_anchored)) |> 
  ungroup() |> 
  select(-c(Dim1:Dim384))

# linear regression
surprise_mod_ccr_anchored <- lm(
  surprise ~ memType, 
  hippocorpus_surprise_ccr_anchored
  )

summary(surprise_mod_ccr_anchored)
```

We found a significant difference between imagined and recalled stories such that recalled stories had more surprising content (p < .001)! We also found that retold stories had more surprising content than imagined stories (p < .001). These results support Sap et al.'s hypothesis that true autobiographical stories would include more surprising events than imagined stories.

**An example of using anchored vectors and CCR in research:** @simchon_etal_2023 collected 10,000 posts from the [r/depression](https://www.reddit.com/r/depression) subreddit, along with a control group of 100 posts each from 100 randomly selected subreddits. They then used a variant of SBERT, `all-MiniLM-L6-v2` (see @sec-contextualized-embeddings), to compute CCR embeddings of a psychological questionnaire measuring "locus of control," the feeling that you have control over your own life. The questionnaire included items measuring an internal locus of control ("I have control"), and items measuring an external locus of control ("External forces have control"). Simchon et al. constructed an anchored vector to capture the direction between internal and external locus of control, and projected embeddings of the Reddit posts onto that vector to measure how much each post reflected an internal vs. an external locus of control. They found that posts in r/depression exhibited a more external locus of control than posts in the control group.

### Correlational Anchored Vectors {#sec-correlational-anchors}

In @sec-generating-dictionaries, we used the [Crowdflower Emotion in Text dataset](https://data.world/crowdflower/sentiment-analysis-in-text) to generate a new dictionary for the emotion of surprise. We can use a similar approach to generate an anchored vector. Remember that the anchored vector for surprise is simply a direction in the embedding space. Rather than finding this direction by subtracting a negative construct embedding from a positive one (as we did in @sec-ccr-improvement and @sec-ccr-improvement), we can use machine learning to find the direction that best represents surprise in a training dataset.

To train an anchored vector on the Crowdflower dataset, we will first need to embed its 40,000 Twitter posts. We will do this just as we did for the Hippocorpus texts in @sec-word2vec.

```{r}
#| output: false

# data from https://data.world/crowdflower/sentiment-analysis-in-text
crowdflower <- read_csv("data/text_emotion.csv") |> 
  rename(text = content) |> 
  mutate(
    doc_id = as.character(tweet_id),
    surprise = if_else(sentiment == "surprise", "surprise", "no surprise"),
    surprise = factor(surprise, levels = c("no surprise", "surprise"))
  )

crowdflower_dfm <- crowdflower |> 
  corpus() |> 
  tokens(remove_punct = TRUE, remove_url = TRUE) |> 
  dfm()

# word2vec document embeddings
crowdflower_word2vec <- crowdflower_dfm |> 
  textstat_embedding(word2vec_mod)

crowdflower <- crowdflower |> 
  left_join(crowdflower_word2vec, by = "doc_id")
```

With Partial Least Squares (PLS) regression [@mevik_wehrens_2007; @wold_etal_2001], which finds directions in the feature space that best correlate with the dependent variable (in this case, surprise), we can create a **correlational anchored vector**. 

```{r}
library(caret)

set.seed(2024)
pls_surprise <- train(
  surprise ~ ., 
  data = select(crowdflower, surprise, V1:V300), 
  method = "pls",
  scale = FALSE,  # keep original embedding dimensions
  trControl = trainControl("cv", number = 10),  # cross-validation
  tuneLength = 1  # only 1 component (our anchored vector)
)

surprise_anchored_pls <- pls_surprise$finalModel$projection[,1]
```

With the new correlational anchored vector, we can redo our analysis from @sec-ddr.

```{r}
# score documents by surprise
hippocorpus_surprise_anchored_pls <- hippocorpus_word2vec |> 
  rowwise() |> 
  mutate(surprise = dot_prod(c_across(V1:V300), surprise_anchored_pls)) |> 
  ungroup() |> 
  select(-c(V1:V300))

# rejoin docvars
hippocorpus_surprise_anchored_pls <- hippocorpus_surprise_anchored_pls |> 
  bind_cols(docvars(hippocorpus_corp))

surprise_mod_anchored_pls <- lm(
  surprise ~ memType,
  data = hippocorpus_surprise_anchored_pls
  )

summary(surprise_mod_anchored_pls)
```

Once again we find significant results in support of @sap_etal_2022!

### Machine Learning Methods {#sec-machine-learning-methods}

After @sec-correlational-anchors, you may wonder why we stopped at a single direction in embedding space. Why not go all out with the machine learning? If you wondered this, great job! Psychologists are increasingly training machine learning algorithms on text embeddings to quantify relevant constructs [@kjell_etal_2022]. Indeed, this is the approach used to generate [the cover of this book](https://github.com/rimonim/ds4psych/blob/main/cover.R).

With machine learning approaches, the nonlinearity of contextualized embedding spaces becomes less of a problem. Given enough training data, we can specify a model that can capture nonlinear patterns, such as a [support vector machine](https://rpubs.com/uky994/593668). We could also simultaneously use embeddings from multiple layers of the LLM with `aggregation_from_layers_to_tokens = "concatenate"` in `textEmbed()`. Some research advises using both the `[CLS]` token and an average of the other token embeddings as input to the machine learning model [@lee_etal_2023]. There is no blanket rule about which machine learning algorithms work best with embeddings, but @kjell_etal_2022 recommend ridge regression for continuous outputs, and random forest for classification. If you are not comfortable fitting machine learning algorithms in R, you can use the convenience function, `textTrain()`, provided by the `text` package. In the example code below, we train a random forest model on the Crowdflower dataset, and use it to identify surprise in the Hippocorpus texts.

```{r}
#| eval: false
library(text)

# embed Hippocorpus texts
hippocorpus_subset_distilroberta <- textEmbed(
  hippocorpus_df$story,
  model = "distilroberta-base",
  layers = c(-2, -1),  # last two layers
  # aggregate token embeddings in each layer, then concatenate layers
  aggregation_from_tokens_to_texts = "mean", 
  aggregation_from_layers_to_tokens = "concatenate", 
  dim_name = FALSE,
  keep_token_embeddings = FALSE
  )

# load training set
set.seed(2024)
crowdflower_subset <- crowdflower |> 
  select(doc_id, text, surprise) |> 
  group_by(surprise) |> 
  slice_sample(n = 2000)

# embed training set
crowdflower_subset_distilroberta <- textEmbed(
  crowdflower_subset$text,
  model = "distilroberta-base",
  layers = c(-2, -1),  # last two layers
  # aggregate token embeddings in each layer, then concatenate layers
  aggregation_from_tokens_to_texts = "mean", 
  aggregation_from_layers_to_tokens = "concatenate", 
  dim_name = FALSE,
  keep_token_embeddings = FALSE
  )

# fit random forest model
surprise_randomforest <- textTrain(
  x = crowdflower_subset_distilroberta$texts$texts,
  y = crowdflower_subset$surprise
  )

# predict on Hippocorpus texts
surprise_pred <- textPredict(
  surprise_randomforest,
  hippocorpus_subset_distilroberta$texts$texts
  )
```

**An example of using embedding-based machine learning models trained in research:** @chersoni_etal_2021 used PLS regression to map word embeddings from various models (including word2vec, fastText, GloVe, and BERT) to human-rated semantic features derived from research in cognitive psychology. By comparing the performance of the different models, they could draw inferences about the types of information encoded in words. They found that cognition, causal reasoning, and social content were best predicted across models. General categories (e.g. vision, arousal) tended to be better predicted than specific characteristics (e.g. dark, light, happy, sad).