-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
639 lines (464 loc) · 20.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
---
title: "textshape"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
md_document:
toc: true
---
```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(knitr)
knitr::opts_chunk$set(fig.path = "tools/figure/")
combine <- textshape::combine
library(tidyverse)
library(magrittr)
library(ggstance)
library(textshape)
library(gridExtra)
library(viridis)
library(quanteda)
library(gofastr)
## desc <- suppressWarnings(readLines("DESCRIPTION"))
## regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
## loc <- grep(regex, desc)
## ver <- gsub(regex, "\\2", desc[loc])
## verbadge <- sprintf('<a href="https://img.shields.io/badge/Version-%s-orange.svg"><img ## src="https://img.shields.io/badge/Version-%s-orange.svg" alt="Version"/></a></p>', ver, ver)
## verbadge <- ''
```
```{r, echo=FALSE}
knit_hooks$set(htmlcap = function(before, options, envir) {
if(!before) {
paste('<p class="caption"><b><em>',options$htmlcap,"</em></b></p>",sep="")
}
})
knitr::opts_knit$set(self.contained = TRUE, cache = FALSE)
```
[![Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows.](http://www.repostatus.org/badges/latest/inactive.svg)](https://www.repostatus.org/)
[![](http://cranlogs.r-pkg.org/badges/textshape)](https://cran.r-project.org/package=textshape)
![](tools/textshape_logo/r_textshape.png)
**textshape** is small suite of text reshaping and restructuring functions. Many of these functions are descended from tools in the [**qdapTools**](https://github.com/trinker/qdapTools) package. This brings reshaping tools under one roof with specific functionality of the package limited to text reshaping.
Other R packages provide some of the same functionality. **textshape** differs from these packages in that it is designed to help the user take unstructured data (or implicitly structured), extract it into a structured format, and then restructure into common text analysis formats for use in the next stage of the text analysis pipeline. The implicit structure of seemingly unstructured data is often detectable/expressible by the researcher. **textshape** provides tools (e.g., `split_match`) to enable the researcher to convert this tacit knowledge into a form that can be used to reformat data into more structured formats. This package is meant to be used jointly with the [**textclean**](https://github.com/trinker/textclean) package, which provides cleaning and text normalization functionality.
# Functions
Most of the functions split, expand, grab, or tidy a `vector`, `list`, `data.frame`, or `DocumentTermMatrix`. The `combine`, `duration`, `mtabulate`, & `flatten` functions are notable exceptions. The table below describes the functions and their use:
| Function | Used On | Description |
|------------------|--------------------------------|--------------------------------------------------------------|
| `combine` | `vector`, `list`, `data.frame` | Combine and collapse elements |
| `tidy_list` | `list` of `vector`s or `data.frame`s | Row bind a list and repeat list names as id column |
| `tidy_vector` | `vector` | Column bind a named atomic `vector`'s names and values |
| `tidy_table` | `table` | Column bind a `table`'s names and values |
| `tidy_matrix` | `matrix` | Stack values, repeat column row names accordingly |
| `tidy_dtm`/`tidy_tdm` | `DocumentTermMatrix` | Tidy format `DocumentTermMatrix`/`TermDocumentMatrix` |
| `tidy_colo_dtm`/`tidy_colo_tdm` | `DocumentTermMatrix` | Tidy format of collocating words from a `DocumentTermMatrix`/`TermDocumentMatrix` |
| `duration` | `vector`, `data.frame` | Get duration (start-end times) for turns of talk in n words |
| `from_to` | `vector`, `data.frame` | Prepare speaker data for a flow network |
| `mtabulate` | `vector`, `list`, `data.frame` | Dataframe/list version of `tabulate` to produce count matrix |
| `flatten` | `list` | Flatten nested, named list to single tier |
| `unnest_text` | `data.frame` | Unnest a nested text column |
| `split_index` | `vector`, `list`, `data.frame` | Split at specified indices |
| `split_match` | `vector` | Split vector at specified character/regex match |
| `split_portion` | `vector`\* | Split data into portioned chunks |
| `split_run` | `vector`, `data.frame` | Split runs (e.g., "aaabbbbcdddd") |
| `split_sentence` | `vector`, `data.frame` | Split sentences |
| `split_speaker` | `data.frame` | Split combined speakers (e.g., "Josh, Jake, Jim") |
| `split_token` | `vector`, `data.frame` | Split words and punctuation |
| `split_transcript` | `vector` | Split speaker and dialogue (e.g., "greg: Who me") |
| `split_word` | `vector`, `data.frame` | Split words |
| `grab_index` | `vector`, `data.frame`, `list` | Grab from an index up to a second index |
| `grab_match` | `vector`, `data.frame`, `list` | Grab from a regex match up to a second regex match |
| `column_to_rownames` | `data.frame` | Add a column as rownames |
| `cluster_matrix` | `matrix` | Reorder column/rows of a matrix via hierarchical clustering |
\****Note:*** *Text vector accompanied by aggregating `grouping.var` argument, which can be in the form of a `vector`, `list`, or `data.frame`*
# Installation
To download the development version of **textshape**:
Download the [zip ball](https://github.com/trinker/textshape/zipball/master) or [tar ball](https://github.com/trinker/textshape/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:
```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textshape")
```
# Contact
You are welcome to:
* submit suggestions and bug-reports at: <https://github.com/trinker/textshape/issues>
# Contributing
Contributions are welcome from anyone subject to the following rules:
- Abide by the [code of conduct](https://github.com/trinker/textshape/blob/master/CODE_OF_CONDUCT.md).
- Follow the style conventions of the package (indentation, function & argument naming, commenting, etc.)
- All contributions must be consistent with the package license (GPL-2)
- Submit contributions as a pull request. Clearly state what the changes are and try to keep the number of changes per pull request as low as possible.
- If you make big changes, add your name to the 'Author' field.
# Examples
The main shaping functions can be broken into the categories of (a) binding, (b) combining, (c) tabulating, (d) spanning, (e) splitting, (f) grabbing & (e) tidying. The majority of functions in **textshape** fall into the category of splitting and expanding (the semantic opposite of combining). These sections will provide example uses of the functions from **textshape** within the three categories.
# Loading Dependencies
```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, magrittr, ggstance, viridis, gridExtra, quanteda)
pacman::p_load_current_gh('trinker/gofastr', 'trinker/textshape')
```
## Tidying
The `tidy_xxx` functions convert untidy structures into [tidy format](https://www.jstatsoft.org/article/view/v059i10). Tidy formatted text data structures are particularly useful for interfacing with **ggplot2**, which expects this form.
The `tidy_list` function is used in the style of `do.call(rbind, list(x1, x2))` as a convenient way to bind together multiple named `data.frame`s or `vectors`s into a single `data.frame` with the `list` `names` acting as an id column. The `data.frame` bind is particularly useful for binding transcripts from different observations. Additionally, `tidy_vector` and `tidy_table` are provided for `cbinding` a `table`'s or named atomic `vector`'s values and names as separate columns in a `data.frame`. Lastly, `tidy_dtm`/`tidy_tdm` provide convenient ways to tidy a `DocumentTermMatrix` or `TermDocumentMatrix`.
#### A Vector
```{r}
x <- list(p=1:500, r=letters)
tidy_list(x)
```
#### A Dataframe
```{r}
x <- list(p=mtcars, r=mtcars, z=mtcars, d=mtcars)
tidy_list(x)
```
#### A Named Vector
```{r}
x <- setNames(
sample(LETTERS[1:6], 1000, TRUE),
sample(state.name[1:5], 1000, TRUE)
)
tidy_vector(x)
```
#### A Table
```{r}
x <- table(sample(LETTERS[1:6], 1000, TRUE))
tidy_table(x)
```
#### A Matrix
```{r}
mat <- matrix(1:16, nrow = 4,
dimnames = list(LETTERS[1:4], LETTERS[23:26])
)
mat
tidy_matrix(mat)
```
With clustering (column and row reordering) via the `cluster_matrix` function.
```{r}
## plot heatmap w/o clustering
wo <- mtcars %>%
cor() %>%
tidy_matrix('car', 'var') %>%
ggplot(aes(var, car, fill = value)) +
geom_tile() +
scale_fill_viridis(name = expression(r[xy])) +
theme(
axis.text.y = element_text(size = 8) ,
axis.text.x = element_text(size = 8, hjust = 1, vjust = 1, angle = 45),
legend.position = 'bottom',
legend.key.height = grid::unit(.1, 'cm'),
legend.key.width = grid::unit(.5, 'cm')
) +
labs(subtitle = "With Out Clustering")
## plot heatmap w clustering
w <- mtcars %>%
cor() %>%
cluster_matrix() %>%
tidy_matrix('car', 'var') %>%
mutate(
var = factor(var, levels = unique(var)),
car = factor(car, levels = unique(car))
) %>%
group_by(var) %>%
ggplot(aes(var, car, fill = value)) +
geom_tile() +
scale_fill_viridis(name = expression(r[xy])) +
theme(
axis.text.y = element_text(size = 8) ,
axis.text.x = element_text(size = 8, hjust = 1, vjust = 1, angle = 45),
legend.position = 'bottom',
legend.key.height = grid::unit(.1, 'cm'),
legend.key.width = grid::unit(.5, 'cm')
) +
labs(subtitle = "With Clustering")
grid.arrange(wo, w, ncol = 2)
```
#### A DocumentTermMatrix
The `tidy_dtm` and `tidy_tdm` functions convert a `DocumentTermMatrix` or `TermDocumentMatrix` into a tidied data set.
```{r, warning=FALSE}
my_dtm <- with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_")))
tidy_dtm(my_dtm) %>%
tidyr::extract(doc, c("time", "turn", "sentence"), "(\\d)_(\\d+)\\.(\\d+)") %>%
mutate(
time = as.numeric(time),
turn = as.numeric(turn),
sentence = as.numeric(sentence)
) %>%
tbl_df() %T>%
print() %>%
group_by(time, term) %>%
summarize(n = sum(n)) %>%
group_by(time) %>%
arrange(desc(n)) %>%
slice(1:10) %>%
mutate(term = factor(paste(term, time, sep = "__"), levels = rev(paste(term, time, sep = "__")))) %>%
ggplot(aes(x = n, y = term)) +
geom_barh(stat='identity') +
facet_wrap(~time, ncol=2, scales = 'free_y') +
scale_y_discrete(labels = function(x) gsub("__.+$", "", x))
```
#### A DocumentTermMatrix of Collocations
The `tidy_colo_dtm` and `tidy_colo_tdm` functions convert a `DocumentTermMatrix` or `TermDocumentMatrix` into a collocation matrix and then a tidied data set.
```{r}
my_dtm <- with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_")))
sw <- unique(c(
lexicon::sw_jockers,
lexicon::sw_loughran_mcdonald_long,
lexicon::sw_fry_1000
))
tidy_colo_dtm(my_dtm) %>%
tbl_df() %>%
filter(!term_1 %in% c('i', sw) & !term_2 %in% sw) %>%
filter(term_1 != term_2) %>%
unique_pairs() %>%
filter(n > 15) %>%
complete(term_1, term_2, fill = list(n = 0)) %>%
ggplot(aes(x = term_1, y = term_2, fill = n)) +
geom_tile() +
scale_fill_gradient(low= 'white', high = 'red') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
## Combining
The `combine` function acts like `paste(x, collapse=" ")` on vectors and lists of vectors. On dataframes multiple text cells are pasted together within grouping variables.
#### A Vector
```{r}
x <- c("Computer", "is", "fun", ".", "Not", "too", "fun", ".")
combine(x)
```
#### A Dataframe
```{r}
(dat <- split_sentence(DATA))
combine(dat[, 1:5, with=FALSE])
```
## Tabulating
`mtabulate` allows the user to transform data types into a dataframe of counts.
#### A Vector
```{r}
(x <- list(w=letters[1:10], x=letters[1:5], z=letters))
mtabulate(x)
## Dummy coding
mtabulate(mtcars$cyl[1:10])
```
#### A Dataframe
```{r}
(dat <- data.frame(matrix(sample(c("A", "B"), 30, TRUE), ncol=3)))
mtabulate(dat)
t(mtabulate(dat))
```
## Spanning
Often it is useful to know the duration (start-end) of turns of talk. The `duration` function calculates start-end durations as n words.
#### A Vector
```{r}
(x <- c(
"Mr. Brown comes! He says hello. i give him coffee.",
"I'll go at 5 p. m. eastern time. Or somewhere in between!",
"go there"
))
duration(x)
# With grouping variables
groups <- list(group1 = c("A", "B", "A"), group2 = c("red", "red", "green"))
duration(x, groups)
```
#### A Dataframe
```{r}
duration(DATA)
```
#### Gantt Plot
```{r}
library(ggplot2)
ggplot(duration(DATA), aes(x = start, xend = end, y = person, yend = person, color = sex)) +
geom_segment(size=4) +
xlab("Duration (Words)") +
ylab("Person")
```
## Splitting
The following section provides examples of available splitting functions.
### Indices
`split_index` allows the user to supply the integer indices of where to split a data type.
#### A Vector
```{r}
split_index(
LETTERS,
indices = c(4, 10, 16),
names = c("dog", "cat", "chicken", "rabbit")
)
```
#### A Dataframe
Here I calculate the indices of every time the `vs` variable in the `mtcars` data set changes and then split the dataframe on those indices. The `change_index` function is handy for extracting the indices of changes in runs within an atomic vector.
```{r}
(vs_change <- change_index(mtcars[["vs"]]))
split_index(mtcars, indices = vs_change)
```
### Matches
`split_match` splits on elements that match exactly or via a regular expression match.
#### Exact Match
```{r}
set.seed(15)
(x <- sample(c("", LETTERS[1:10]), 25, TRUE, prob=c(.2, rep(.08, 10))))
split_match(x)
split_match(x, split = "C")
split_match(x, split = c("", "C"))
## Don't include
split_match(x, include = 0)
## Include at beginning
split_match(x, include = 1)
## Include at end
split_match(x, include = 2)
```
#### Regex Match
Here I use the regex `"^I"` to break on any vectors containing the capital letter I as the first character.
```{r}
split_match(DATA[["state"]], split = "^I", regex=TRUE, include = 1)
```
### Portions
At times it is useful to split texts into portioned chunks, operate on the chunks and aggregate the results. `split_portion` allows the user to do this sort of text shaping. We can split into n chunks per grouping variable (via `n.chunks`) or into chunks of n length (via `n.words`).
#### A Vector
```{r}
with(DATA, split_portion(state, n.chunks = 10))
with(DATA, split_portion(state, n.words = 10))
```
#### A Dataframe
```{r}
with(DATA, split_portion(state, list(sex, adult), n.words = 10))
```
### Runs
`split_run` allows the user to split up runs of identical characters.
```{r}
x1 <- c(
"122333444455555666666",
NA,
"abbcccddddeeeeeffffff",
"sddfg",
"11112222333"
)
x <- c(rep(x1, 2), ">>???,,,,....::::;[[")
split_run(x)
```
#### Dataframe
```{r}
DATA[["run.col"]] <- x
split_run(DATA)
## Reset the DATA dataset
DATA <- textshape::DATA
```
### Sentences
`split_sentece` provides a mapping + regex approach to splitting sentences. It is less accurate than the Stanford parser but more accurate than a simple regular expression approach alone.
#### A Vector
```{r}
(x <- paste0(
"Mr. Brown comes! He says hello. i give him coffee. i will ",
"go at 5 p. m. eastern time. Or somewhere in between!go there"
))
split_sentence(x)
```
#### A Dataframe
```{r}
split_sentence(DATA)
```
### Speakers
Often speakers may talk in unison. This is often displayed in a single cell as a comma separated string of speakers. Some analysis may require this information to be parsed out and replicated as one turn per speaker. The `split_speaker` function accomplishes this.
```{r}
DATA$person <- as.character(DATA$person)
DATA$person[c(1, 4, 6)] <- c("greg, sally, & sam",
"greg, sally", "sam and sally")
DATA
split_speaker(DATA)
## Reset the DATA dataset
DATA <- textshape::DATA
```
### Tokens
The `split_token` function split data into words and punctuation.
#### A Vector
```{r}
(x <- c(
"Mr. Brown comes! He says hello. i give him coffee.",
"I'll go at 5 p. m. eastern time. Or somewhere in between!",
"go there"
))
split_token(x)
```
#### A Dataframe
```{r}
split_token(DATA)
```
### Transcript
The `split_transcript` function splits `vector`s with speaker prefixes (e.g., `c("greg: Who me", "sarah: yes you!")`) into a two column `data.frame`.
#### A Vector
```{r}
(x <- c(
"greg: Who me",
"sarah: yes you!",
"greg: well why didn't you say so?",
"sarah: I did but you weren't listening.",
"greg: oh :-/ I see...",
"dan: Ok let's meet at 4:30 pm for drinks"
))
split_transcript(x)
```
### Words
The `split_word` function splits data into words.
#### A Vector
```{r}
(x <- c(
"Mr. Brown comes! He says hello. i give him coffee.",
"I'll go at 5 p. m. eastern time. Or somewhere in between!",
"go there"
))
split_word(x)
```
#### A Dataframe
```{r}
split_word(DATA)
```
## Grabbing
The following section provides examples of available grabbing (from a starting point up to an ending point) functions.
### Indices
`grab_index` allows the user to supply the integer indices of where to grab (from - up to) a data type.
#### A Vector
```{r}
grab_index(DATA$state, from = 2, to = 4)
grab_index(DATA$state, from = 9)
grab_index(DATA$state, to = 3)
```
#### A Dataframe
```{r}
grab_index(DATA, from = 2, to = 4)
```
#### A List
```{r}
grab_index(as.list(DATA$state), from = 2, to = 4)
```
### Matches
`grab_match` grabs (from - up to) elements that match a regular expression.
#### A Vector
```{r}
grab_match(DATA$state, from = 'dumb', to = 'liar')
grab_match(DATA$state, from = '^What are')
grab_match(DATA$state, to = 'we do[?]')
grab_match(DATA$state, from = 'no', to = 'the', ignore.case = TRUE,
from.n = 'last', to.n = 'first')
```
#### A Dataframe
```{r}
grab_match(DATA, from = 'dumb', to = 'liar')
```
#### A List
```{r}
grab_match(as.list(DATA$state), from = 'dumb', to = 'liar')
```
## Putting It Together
Eduardo Flores blogged about [What the candidates say, analyzing republican debates using R](https://www.r-bloggers.com/2015/11/what-the-candidates-say-analyzing-republican-debates-using-r/) where he demonstrated some scraping and analysis techniques. Here I highlight a combination usage of **textshape** tools to scrape and structure the text from 4 of the 2015 Republican debates within a [**magrittr**](https://github.com/tidyverse/magrittr) pipeline. The result is a single [**data.table**](https://github.com/Rdatatable/data.table) containing the dialogue from all 4 debates. The code highlights the conciseness and readability of **textshape** by restructuring Flores scraping with **textshape** replacements.
```{r}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(rvest, magrittr, xml2)
debates <- c(
wisconsin = "110908",
boulder = "110906",
california = "110756",
ohio = "110489"
)
lapply(debates, function(x){
xml2::read_html(paste0("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)) %>%
rvest::html_nodes("p") %>%
rvest::html_text() %>%
textshape::split_index(., grep("^[A-Z]+:", .)) %>%
#textshape::split_match("^[A-Z]+:", TRUE, TRUE) %>% #equal to line above
textshape::combine() %>%
textshape::split_transcript() %>%
textshape::split_sentence()
}) %>%
textshape::tidy_list("location")
```