-
Notifications
You must be signed in to change notification settings - Fork 181
/
Copy pathREADME.Rmd
190 lines (140 loc) · 7.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r}
#| include = FALSE
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
message = FALSE
)
suppressPackageStartupMessages(library(ggplot2))
theme_set(theme_light())
```
# tidytext: Text mining using tidy tools <img src="man/figures/tidytext.png" align="right" />
**Authors:** [Julia Silge](https://juliasilge.com/), [David Robinson](http://varianceexplained.org/)<br/>
**License:** [MIT](https://opensource.org/licenses/MIT)
<!-- badges: start -->
[![R-CMD-check](https://github.com/juliasilge/tidytext/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/juliasilge/tidytext/actions/workflows/R-CMD-check.yaml)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/tidytext)](https://cran.r-project.org/package=tidytext)
[![Codecov test coverage](https://codecov.io/gh/juliasilge/tidytext/branch/main/graph/badge.svg)](https://app.codecov.io/gh/juliasilge/tidytext?branch=main)
[![DOI](https://zenodo.org/badge/22224/juliasilge/tidytext.svg)](https://zenodo.org/badge/latestdoi/22224/juliasilge/tidytext)
[![JOSS](https://joss.theoj.org/papers/10.21105/joss.00037/status.svg)](https://joss.theoj.org/papers/10.21105/joss.00037)
[![Downloads](https://cranlogs.r-pkg.org/badges/tidytext)](https://CRAN.R-project.org/package=tidytext)
[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/tidytext?color=orange)](https://CRAN.R-project.org/package=tidytext)
<!-- badges: end -->
Using [tidy data principles](https://doi.org/10.18637/jss.v059.i10) can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Much of the infrastructure needed for text mining with tidy data frames already exists in packages like [dplyr](https://cran.r-project.org/package=dplyr), [broom](https://cran.r-project.org/package=broom), [tidyr](https://cran.r-project.org/package=tidyr), and [ggplot2](https://cran.r-project.org/package=ggplot2). In this package, we provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. Check out [our book](https://www.tidytextmining.com/) to learn more about text mining using tidy data principles.
### Installation
You can install this package from CRAN:
```{r}
#| eval = FALSE
install.packages("tidytext")
```
Or you can install the development version from GitHub with [remotes](https://github.com/r-lib/remotes):
```{r}
#| eval = FALSE
library(remotes)
install_github("juliasilge/tidytext")
```
### Tidy text mining example: the `unnest_tokens` function
The novels of Jane Austen can be so tidy! Let's use the text of Jane Austen's 6 completed, published novels from the [janeaustenr](https://cran.r-project.org/package=janeaustenr) package, and transform them to a tidy format. janeaustenr provides them as a one-row-per-line format:
```{r}
library(janeaustenr)
library(dplyr)
original_books <- austen_books() %>%
group_by(book) %>%
mutate(line = row_number()) %>%
ungroup()
original_books
```
To work with this as a tidy dataset, we need to restructure it as **one-token-per-row** format. The `unnest_tokens()` function is a way to convert a dataframe with a text column to be one-token-per-row:
```{r}
library(tidytext)
tidy_books <- original_books %>%
unnest_tokens(word, text)
tidy_books
```
This function uses the [tokenizers](https://docs.ropensci.org/tokenizers/) package to separate each line into words. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.
Now that the data is in a one-word-per-row format, we can manipulate it with tidy tools like dplyr. We can remove stop words (available via the function `get_stopwords()`) with an `anti_join()`.
```{r}
tidy_books <- tidy_books %>%
anti_join(get_stopwords())
```
We can also use `count()` to find the most common words in all the books as a whole.
```{r}
tidy_books %>%
count(word, sort = TRUE)
```
Sentiment analysis can be implemented as an inner join. Three sentiment lexicons are available via the `get_sentiments()` function. Let's examine how sentiment changes across each novel. Let's find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each novel.
```{r}
#| fig.width = 8,
#| fig.height = 10
library(tidyr)
get_sentiments("bing")
janeaustensentiment <- tidy_books %>%
inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") %>%
count(book, index = line %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
janeaustensentiment
```
Now we can plot these sentiment scores across the plot trajectory of each novel.
```{r}
#| fig.width = 7,
#| fig.height = 7,
#| fig.alt = "Sentiment scores across the trajectories of Jane Austen's six published novels",
#| warning = FALSE
library(ggplot2)
ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(book), ncol = 2, scales = "free_x")
```
For more examples of text mining using tidy data frames, see the tidytext vignette.
### Tidying document term matrices
Some existing text mining datasets are in the form of a DocumentTermMatrix class (from the tm package). For example, consider the corpus of 2246 Associated Press articles from the topicmodels dataset.
```{r}
library(tm)
data("AssociatedPress", package = "topicmodels")
AssociatedPress
```
If we want to analyze this with tidy tools, we need to transform it into a one-row-per-term data frame first with a `tidy()` function. (For more on the tidy verb, [see the broom package](https://broom.tidymodels.org/)).
```{r}
tidy(AssociatedPress)
```
We could find the most negative documents:
```{r}
ap_sentiments <- tidy(AssociatedPress) %>%
inner_join(get_sentiments("bing"), by = c(term = "word")) %>%
count(document, sentiment, wt = count) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) %>%
arrange(sentiment)
```
Or we can join the Austen and AP datasets and compare the frequencies of each word:
```{r}
#| fig.height = 8,
#| fig.width = 8,
#| fig.alt = 'Scatterplot for word frequencies in Jane Austen vs. AP news articles. Some words like "cried" are only common in Jane Austen, some words like "national" are only common in AP articles, and some word like "time" are common in both.'
comparison <- tidy(AssociatedPress) %>%
count(word = term) %>%
rename(AP = n) %>%
inner_join(count(tidy_books, word)) %>%
rename(Austen = n) %>%
mutate(AP = AP / sum(AP),
Austen = Austen / sum(Austen))
comparison
library(scales)
ggplot(comparison, aes(AP, Austen)) +
geom_point(alpha = 0.5) +
geom_text(aes(label = word), check_overlap = TRUE,
vjust = 1, hjust = 1) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
geom_abline(color = "red")
```
For more examples of working with objects from other text mining packages using tidy data principles, see the [vignette](https://juliasilge.github.io/tidytext/articles/tidying_casting.html) on converting to and from document term matrices.
### Community Guidelines
This project is released with a [Contributor Code of Conduct](https://github.com/juliasilge/tidytext/blob/main/CONDUCT.md). By participating in this project you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support [here](https://github.com/juliasilge/tidytext/issues).