intro.qmd

# Introduction: Why Does Psychology Need Natural Language?

## "The Mind's Messenger"

Psychology is often referred to as a "behavioral science". Humans engage in many behaviors: eating, sleeping, pressing buttons, coming in five minutes early to work... All sorts of behaviors can reveal secrets of the mind, but one type of behavior in particular has always struck people as greater than all others in its ability to express the richness of thought: language. The 11th century philosopher Bahaye ibn Paquda wrote:

> Speech and the orderly arrangement of words \[are the ways in which a human\] gives expression to what is in his mind and soul and understands the conditions of others. The tongue is the heart's pen and the mind's messenger. Without speech, there would be no social relations between one person and another.[^intro-1]

[^intro-1]: Duties of the Heart, Second Treatise on Examination (Chapter 5), trans. Rabbi Moses Hyamson, New York, 1925

Somewhat more recently, @tausczik_pennebaker_2010 expressed the same sentiment:

> Language is the most common and reliable way for people to translate their internal thoughts and emotions into a form that others can understand. Words and language, then, are the very stuff of psychology and communication.

What exactly is the relationship between psychology and language? The 14th century philosopher William of Ockham proposed that thought itself has an essentially linguistic structure, with subjects, objects, and verbs. According to Ockham then, spoken or written language is a sort of rough reflection of inner language. Theories like Ockham's, in which language is a straightforward representation of inner psychological life, have been common over the history of philosophy. Nowadays, studies of neurological patients have made it clear that linguistic abilities are not necessary for complex thought [@fedorenko_varley_2016]. Likewise, now-classic research has made it clear that people cannot be trusted to accurately report their own thought processes [@nisbett_wilson_1977]. Nevertheless, language is without doubt a centrally important mode of human behavior. Humans are constantly talking to each other or listening to each other talk. Even if this talk cannot be construed as reliable reporting about internal states, it must reflect those states in some way, in the same way that any other behavior must do so.

## Problems With Language Data, and Why Data Science Solves Them

Psychologists have long insisted that talk therapy can heal, and that questionnaires can accurately measure psychological phenomena. These are language-based techniques, which rely on the assumption that language processing is linked to more fundamental internal states. Even so, psychological research has generally been unable to study *naturalistic language*, i.e. the sort of language that people produce in their day-to-day lives. There are three good reasons for this. Let's go through each one and explain how data science solves the problem.

### Language is Hard to Record

Before the invention of audio recording technology, using language in scientific research was nearly impossible. Early efforts by linguists to record language samples from representative populations were heroic; starting in 1896, Edmond Edmont spent four years traveling around France on a bicycle conducting specially designed interviews to collect data for the *Atlas linguistique de la France*. He collected data from 700 participants in total [@crystal_1997]. Since then, microphones have made it easier to record speech, but even simple quantitative measurements like word counts have still required painstaking hours of listening to recordings and marking each occurrence of the word.

The advent of transformer neural networks has made working with audio data much easier than it once was, but the largest boon to our language recording abilities has come through a different medium: text.

Only a few decades ago, public access to text was limited to highly edited long-form productions like books, magazines, and newspapers. Psychologists tend to be more interested in accessing people's thoughts and feelings as they happen, so these texts held little interest for them. Some psychologists studied diaries or personal letters [e.g. @allport_1942; @creegan_1944], but personal documents like these are hard to collect at scale. This all changed with the advent of the Internet. Now more than ever before, people communicate through text---not just in long-distance correspondence, but for day-to-day socializing with friends and family. Moreover, much of this textual communication is synchronous and shares many of the same features as face-to-face spoken conversation [@placinski_zywiczynski_2023]. Most importantly, much of this textual communication is freely available to researchers, through social media platforms like Twitter, Reddit, and YouTube. Data science techniques allow researchers to access these texts and transform them into manageable datasets with APIs ([@sec-apis]) and web scraping ([@sec-scraping]).

### Language is Hard to Quantify

Even when interesting texts were available to psychologists of the past, they were rarely able to make use of them in quantitative analysis. Language is complex, with near-infinite ways to describe the same thing. There are no clear rules for measuring the extent to which a text reflects depression, anxiety, mania, introspection, or any other psychological construct. The few researchers who tried to extract quantitative psychological dimensions of text were nearly as heroic as Edmond Edmont on his four year journey around France. For example, @peterson_seligman_1984 administered a questionnaire that prompted participants to write short explanations of various hypothetical scenarios. They then carefully read each response, noted each time a phrase like "because of" or "as a result of" was mentioned, and marked the accompanying explanation. These explanations were then typed by hand and shown one at a time to four trained judges who rated them on various 7-point scales. Finally, the agreement between the judges was assessed and their ratings were aggregated into the final variable used in their analysis of risk factors for depression. Today, this sort of analysis could be performed in a matter of seconds using dictionary-based word counts ([@sec-word-counting]), neural embeddings ([@sec-contextualized-embeddings]), or other methods covered in this book.

### Language is Hard to Control

Language is a social phenomenon. People do not write or speak in a vacuum, they participate in conversations or group discussions, considering their audiences as they form their words. For the researcher, this means that language is full of uncontrolled, confounding variables: Is the speaker responding to another speaker? Who is the other speaker? How many participants are there in the conversation? Researchers in the field of psycholinguistics have tried to solve these problems by isolating speakers in a laboratory setting, contriving situations in which participants process and produce speech without the uncontrolled variability of conversational partners [@oconnell_kowal_2003]. Nevertheless, the inherently social nature of language has made it difficult to analyze language behavior in even remotely naturalistic settings.

Today, the highly structured nature of interaction on social media has made the social context of utterances easily measurable. For example, comments on Reddit are always associated with a well-defined community, are responding to a known original post, and are directly responding to previous comments in a tree-like structure ([@sec-reddit-threads]). Researchers can leverage this structure to provide robust statistical control by using it in tandem with new methods for quantifying the relationships between utterances. A few decades ago the question "How similar are Daniel's utterances to Amos's utterances?" would have seemed hopelessly ill-defined. Similar in what way? Today, answering this question is simple with vector-based semantic embeddings ([Chapters -@sec-vectorspace-intro]---[-@sec-contextualized-embeddings]). Methods like these can now make sense of nuanced features of language use in dialogue [see @duran_etal_2019 for an example in psycholinguistics].

This book focuses primarily on how to extract psychological dimensions from text. The statistical analysis of these dimensions will not be covered in depth here. Even so, the methods presented in the book can be used in tandem with modern methods of statistical analysis [e.g. @kenny_etal_2006] to draw inferences from complex social interactions between pairs or even larger groups of people.

## What This Textbook is Useful For

The tools presented in this book are useful for many fields of psychology, cognitive science, and neuroscience. Most of these fields are related to language, but not all. The following are a few examples of ways to apply these tools in practice:

-   Enhance experimental control by matching word stimuli according to semantic similarity [e.g. @gagne_etal_2005]. See @sec-decontextualized-embeddings.
-   Find similarities between large language models and neural processing in the brain [e.g. @millet_etal_2022]. See @sec-contextualized-embeddings.
-   Measure the degree to which a therapist and patient build mutual understanding over the course of a session [cf. @duran_etal_2019]. See @sec-vectorspace-intro.
-   Analyze emotional responses to current events on social media [e.g. @simchon_etal_2020]. See @sec-apis and @sec-machine-learning-methods.
-   Find links between group members' language and their probability of leaving the group [@ashokkumar_pennebaker_2022]. See @sec-corpora and @sec-word-counting.
-   Validate personality assessments with individuals' behavior on social media [@schwartz_etal_2013]. See @sec-dla.

------------------------------------------------------------------------