-
Notifications
You must be signed in to change notification settings - Fork 0
/
contextualized-embeddings.qmd
160 lines (118 loc) · 16.9 KB
/
contextualized-embeddings.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# Contextualization With Large Language Models {#sec-contextualized-embeddings}
```{r setup}
#| echo: false
#| include: false
source("_common.R")
hippocorpus_df <- read_csv("data/hippocorpus-u20220112/hcV3-stories.csv") |>
select(AssignmentId, story, memType, summary, WorkerId,
annotatorGender, openness, timeSinceEvent)
hippocorpus_bert <- readRDS("data/hippocorpus_bert.rds")
hippocorpus_bert <- hippocorpus_df |>
rename(ID = AssignmentId) |>
left_join(hippocorpus_bert)
hippocorpus_sbert <- readRDS("data/hippocorpus_sbert.rds")
hippocorpus_sbert <- hippocorpus_df |>
rename(ID = AssignmentId) |>
left_join(hippocorpus_sbert)
```
The models we discussed in @sec-word-embeddings represent the meaning of each token as a point in multidimensional space: a word embedding. Word embeddings generated by models like word2vec or GloVe are often referred to as **decontextualized embeddings**. This name is a bit confusing, since as we saw in @sec-word-embeddings, the whole point of those models is to associate tokens with the contexts in which they tend to appear. A better name might be _average context embeddings_, since the best they can hope to represent is the average of the contexts in which a token appears throughout the training corpus. For example, consider the following uses of the token "short".
> My dad is very _short_.
> My blender broke because of a _short_ circuit.
> That video was anything but _short_.
> I can't pay because I'm _short_ on cash at the moment.
Any speaker of English can easily see that the word "short" means something different in each one of these examples. But because word2vec and similar models are trained to predict the context based on only a single word at a time, their representation of the word _short_ will only capture that word's average meaning.
How can we move beyond the average meaning and capture the different meanings words take on in different contexts? You are probably already familiar with Large Language Models (LLMs) like ChatGPT and Claude. At their core, much of what these models do is exactly this: They find the intricate relationships between tokens in a text and use them to develop a new understanding of what these tokens mean in the particular context of that text. For this reason, embeddings produced by these models are often referred to as **contextualized embeddings**.
Even if you are familiar with ChatGPT, you may not have realized that it uses embeddings. What do embeddings have to do with generating text? The core of all modern LLMs is a model called the _transformer_. We will not cover exactly how transformers work---for an intuitive introduction, see [3blue1brown's video explanation](https://youtu.be/wjZofJX0v4M?si=Jx0fvQns3nqukk6e). For the purposes of this book, all you need to know is this: Transformers start by converting all the words in a text into word embeddings, just like word2vec or GloVe. At the start, these word embeddings represent the average meaning of each word. The transformer then estimates how each word in the text might be relevant for better understanding the meaning of the other words. For example, if "circuit" appears right after "short", the embedding of "short" should probably be tweaked. Once it has identified this connection, the transformer computes what "circuit" should add to a word that it is associated with, moving the "short" embedding closer to embeddings for electrical concepts. A full LLM has many _layers_. In each layer, the LLM identifies more connections between embeddings and shifts the embeddings in the vector space to add more nuance to their representations. When it gets to the final layer, the LLM uses the enriched embeddings of the words in the text for whatever task it was trained to do (e.g. predicting what the next word will be, or identifying whether the text is spam or not).
Even though LLMs can be extremely complex and capture many nonlinear relationships between concepts, the transformer architecture forces them to organize their embeddings in a roughly linear space in which each direction has a consistent meaning---a critical property for analyzing embeddings with cosine similarity and other straightforward methods. In order to extract an LLM's rich, contextualized embeddings, all we need to do is run it on a text and stop it before it finishes predicting the next word (or whatever else it was trained to do). This way, we can read the LLM's mind, capturing all of the rich associations it has with the text.
## Hugging Face and the `text` Package
Leading commercial LLMs like GPT-4 are hidden behind APIs so that their inner workings are kept secret. We therefore cannot access the embeddings of these high profile models. Nevertheless, plenty of models that are almost as good are open source and easily accessible through [Hugging Face Transformers](https://huggingface.co/docs/transformers/index). New open source models are added to Hugging Face every day, often by leading companies like Google and Meta. Any text-based transformer model can be accessed in R using the [`text`](https://r-text.org) package [@kjell_etal_2021]. The `text` package makes running models and extracting embeddings easy even for those of us who barely understand how the models work.
The `text` package runs Python code behind the scenes, so you will have to set up a Python environment for it to run properly. For instructions on how to do this, see [here](https://www.r-text.org/articles/huggingface_in_r_extended_installation_guide.html). Once you have the package installed and working, you can begin generating contextualized embeddings for texts with the `textEmbed()` function. Let's use the second to last layer of the [`all-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) model to embed the Hippocorpus texts as 384-dimensional vectors. Since the model includes preprocessing and tokenization built in, we can feed it the raw texts as a character vector. By default, `textEmbed()` creates the full text embedding by averaging the contextualized embeddings of each token in the text. This aggregation can take some time (especially for long texts like the Hippocorpus stories), so here we'll just use the embedding of the `[CLS]` (classification) token. The `[CLS]` token is a special token that models based on [BERT](https://jalammar.github.io/illustrated-bert/) add to each text. Because the `[CLS]` token does not have a "real" meaning, but rather is inserted at the same place in every text, its contextualized embedding represents the gist of each text as a whole. In training, BERT models use the contextualized embedding of the `[CLS]` token to predict whether a given text does or does not come after the input text. This makes it a good general use embedding for when aggregating across all tokens is too time-consuming^[Nevertheless, using `[CLS]` embeddings is slightly inferior to averaging the contextualized embeddings across tokens [@reimers_gurevych_2019].]. Even when the output is limited to the `[CLS]` token, `textEmbed()` can take a few hours to run.
```{r}
#| eval: false
library(text)
# full texts (as character vector)
hippocorpus_texts <- hippocorpus_df$story
# embed the texts
hippocorpus_sbert <- textEmbed(
hippocorpus_texts,
model = "sentence-transformers/all-MiniLM-L12-v2", # model name
layers = -2, # second to last layer (default)
tokens_select = "[CLS]", # use only [CLS] token
dim_name = FALSE,
keep_token_embeddings = FALSE
)
# text embeddings as dataframe
hippocorpus_sbert <- hippocorpus_sbert$texts[[1]] |>
mutate(ID = hippocorpus_df$AssignmentId)
# rejoin other variables
hippocorpus_sbert <- hippocorpus_df |>
rename(ID = AssignmentId) |>
left_join(hippocorpus_sbert)
```
### Managing Computational Load
Running large neural networks on your personal computer can be time consuming at best. At worst, your computer runs out of memory and your R session crashes. If you are having problems like these, here are some ways you might lessen the computational load when calling `textEmbed()`:
- Use a smaller model. Hugging Face model pages generally state how many parameters the model has. This is generally a good indication of how much computational capacity the model needs. For example, consider using `distilbert-base-uncased` (67M params) or `albert-base-v2` (11.8M params) instead of `bert-base-uncased` (110M params).
- Make sure you are asking for only one layer at a time (e.g. `layers = -2`).
- If you do not need individual token embeddings, set `keep_token_embeddings = FALSE`.
- To avoid aggregation costs, only ask for individual token embeddings (e.g. `tokens_select = "[CLS]"`). Not every model uses `CLS`. make sure that you specify a token that is used by the model you are running---If you're not sure which special tokens your model uses, try embedding a single text with `tokens_select = NULL` and `keep_token_embeddings = TRUE`, and examining the results.
- Run the model on your GPU with `device = 'gpu'` (not available for Apple M1 and M2 chips).
- If running the full dataset at once is too much for you or your computer, you can break it up into smaller groups of texts, run each on its own, and join them back together afterward.
Before you run `textEmbed()` on your full dataset, always try running it on two or three texts first. This way, you can get a sense of how long it will take, and make sure that the output is to your liking.
### Choosing the Right Model
New LLMs are published on Hugging Face every day, and choosing one for your research can be daunting. Part of the beauty of LLMs is their ability to generalize---most popular models nowadays are trained on enormous datasets with a wide variety of content, and even the smaller models perform more than well enough to capture straightforward psychological concepts like emotional valence [@kjell_etal_2022]. Even if your model isn't perfect on every text, research generally relies on statistical patterns, so if your sample size is large enough, it shouldn't matter. Even with that disclaimer, we can give a few recommendations:
- [BERT](https://huggingface.co/google-bert/bert-base-uncased) [@devlin_etal_2019] and [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) [@sanh_etal_2020] models are reliable and well-studied. The BERT architecture was designed to be applicable to various tasks (not just next-word prediction, like GPT models), so it is likely to generalize well to whatever you need it for. Also, they use the `[CLS]` token, which provides a good general-purpose embedding for cases in which aggregating all token embeddings is too computationally demanding.
- [RoBERTa](https://huggingface.co/FacebookAI/roberta-base) [@liu_etal_2019] is a refined version of BERT that is known to perform better in identifying personal characteristics of the author of a text. @ganesan_etal_2021 found that embeddings from the second to last layer of RoBERTa (averaged across tokens) outperformed BERT, [XLNet](https://huggingface.co/xlnet/xlnet-base-cased), and [GPT-2](https://huggingface.co/openai-community/gpt2) on predicting gender, age, income, openness, extraversion, and suicide risk from social media posts and student essays. @matero_etal_2022 likewise found that RoBERTa was better than other models at predicting depression, and specifically recommended layer 19 when using the `roberta-large` model. [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) is a lightweight alternative with only slightly worse performance [@matero_etal_2022], and [XLM-RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/xlm-roberta) is favored for analysis of non-English texts. RoBERTa models use the `<s>` (start) token instead of `[CLS]`. Nevertheless since RoBERTa is not generally trained on next sentence prediction, the behavior of the `<s>` will depend on the way the particular model you are using was trained.
- [MentalRoBERTa](https://huggingface.co/mental/mental-roberta-base) [@ji_etal_2021] is a version of RoBERTa that was fine-tuned on posts from Reddit communities dealing with mental health issues (e.g. r/depression, r/SuicideWatch, r/Anxiety).
- [SBERT](https://www.sbert.net/docs/pretrained_models.html): Word2vec, GloVe, and related models (@sec-word-embeddings) have architectures that guarantee vector spaces with consistent geometric properties, allowing researchers to confidently compute averages between vectors and interpret linear directions as encoding unique semantic meanings (@sec-embedding-magnitude). In contrast, the ways that LLMs organize meaning in their embedding spaces are not well understood and may not always lend themselves to simple measures like those described in this book [@cai_etal_2021; @li_etal_2020; @reif_etal_2019; zhou_etal_2022]. Some researchers have tried to fix this problem by creating models that encourage embeddings to spread out evenly in the embedding space, or that explicitly optimize for reliable cosine similarity metrics. One popular line of such models is [SBERT](https://www.sbert.net/docs/pretrained_models.html). These models, including [`all-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2), which we use here, have been demonstrated to work well for directional measures like cosine similarity.
## Dimensionality Reduction
Some studies suggest that using Principle Component Analysis (PCA; see @sec-lsa-variations) to reduce the dimensionality of a set of contextualized embeddings can help their ability to quantify psychological characteristics. Specifically, @ganesan_etal_2021 found that reducing RoBERTa embeddings to 64 dimensions using PCA is sometimes better and never worse than using them raw (at least when mapping them to psychological characteristics using machine learning techniques; see @sec-machine-learning-methods). This may be because PCA recenters the embeddings and emphasizes the salient differences between the documents of a particular dataset, which are otherwise fairly similar to each other. To reduce the dimensionality of your embeddings, use the function provided here:
```{r}
# `data`: a dataframe with one embedding per row
# `cols`: tidyselect - columns that contain numeric embedding values
# `reduce_to`: number of dimensions to keep
# `scale`: perform scaling in addition to centering?
reduce_dimensionality <- function(data, cols, reduce_to, scale = FALSE){
in_dat <- dplyr::select(data, {{ cols }})
pca <- stats::prcomp(~., data = in_dat, scale = scale, rank. = reduce_to)
out_dat <- as.data.frame(pca$x)
dplyr::bind_cols( select(data, -{{ cols }}), out_dat )
}
# reduce dimensionality of SBERT embeddings from 384 to 64
hippocorpus_sbert_64d <- reduce_dimensionality(
hippocorpus_sbert,
Dim1:Dim384,
reduce_to = 64
)
```
Dimensionality reduction can also be useful for visualizing your embeddings. Let's plot our Hippocampus texts by reducing their 768-dimensional embeddings to 2 dimensions that can be mapped to the x and y axis.
```{r}
# reduce dimensionality of SBERT embeddings from 384 to 2
hippocorpus_sbert_2d <- reduce_dimensionality(
hippocorpus_bert,
Dim1:Dim384,
reduce_to = 2
)
# plot
hippocorpus_sbert_2d |>
ggplot(aes(PC1, PC2, color = memType)) +
geom_point(alpha = .8) +
scale_color_brewer(palette = "Paired") +
labs(title = "Hippocorpus Story Embeddings, Reduced Dimensionality",
color = "Autobiographical\nStory Type",
x = "Principle Component 1",
y = "Principle Component 2") +
theme_minimal()
```
The first two PCA components are the directions along which the embeddings in the dataset spread out the most. There do seem to be slightly more imagined stories in the bottom right, but otherwise story type seems mostly unrelated to these dimensions. This means that the main ways in which the stories are different from one another are not the ways that imagined, recalled, and retold stories are different from one another. This makes sense---the stories are about all sorts of events, and may even use slightly different dialects of English. The differences that we are interested in are more subtle.
---
::: {.callout-tip icon="false"}
## Advantages of Contextualized Embeddings
- **Can Represent Multiple Senses of Each Word:** This is especially important for analyzing shorter texts---since the sample of words is small, the need to correctly characterize each one is greater.
- **Sensitive to Word Order and Negation:** LLMs can tell the difference between "he's funny but not smart" and "he's smart but not funny."
:::
::: {.callout-important icon="false"}
## Disadvantages of Contextualized Embeddings
- **Computationally Expensive**
- **Mysterious Vector Spaces:** Unlike simple word embedding models (@sec-word-embeddings), LLMs may sometimes organize embeddings in nonlinear patterns. For example, GPT models tend to arrange their embeddings in a spiral [@cai_etal_2021]. This is why BERT text embeddings actually perform worse than word2vec and GloVe when analyzed using cosine similarity [@reimers_gurevych_2019]. Specialized models like [SBERT](https://www.sbert.net/docs/pretrained_models.html) can help with this problem, but are unlikely to solve it entirely.
:::
---