The goal of startrek is to access Star Trek transcripts in a
data.frame
for easy analysis. All transcripts have been parsed from text files to a
tidy data format.
Keep in mind that this is a data package which stores the data locally. There aren’t any functions which scrape data from a reliable source. As of now, the size of this package is ~17.7 MB.
If the size isn’t a concern, you can install the development version from GitHub:
devtools::install_github("tylurp/startrek")
Or, download the data to disk from the data folder in this repository.
To access an episode transcript from The Next Generation series, see the
tng
list:
library(startrek)
library(tibble)
library(dplyr)
library(tidyr)
tng$`The Inner Light`
#> # A tibble: 410 x 6
#> id perspective setting character description line
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 83 3 EXT. SPACE … at warp. PICARD (V… <NA> Captain's l…
#> 2 94 4 INT. BRIDGE PICARD, RIKER,… PICARD <NA> The last ti…
#> 3 99 4 INT. BRIDGE PICARD, RIKER,… GEORDI <NA> Nine hours.…
#> 4 101 4 INT. BRIDGE PICARD, RIKER,… PICARD <NA> "The entire…
#> 5 104 4 INT. BRIDGE PICARD, RIKER,… RIKER <NA> That's a li…
#> 6 107 4 INT. BRIDGE PICARD, RIKER,… PICARD <NA> And for me.…
#> 7 115 4 CONTINUED: PICARD, RIKER,… WORF <NA> Sir, sensor…
#> 8 120 4 CONTINUED: PICARD, RIKER,… PICARD <NA> On screen.
#> 9 126 5 ANGLE - VIE… An alien objec… PICARD <NA> Magnify.
#> 10 130 5 ANGLE - VIE… The object spr… PICARD <NA> Mister Data?
#> # … with 400 more rows
Or access the entire series and play with the data in creative ways. For example, we might infer character specific episodes by counting the number of lines each character has in each episode:
tng %>%
bind_rows(.id = "episode") %>%
select(episode, everything()) %>%
group_by(episode) %>%
count(character, sort = TRUE)
#> # A tibble: 4,227 x 3
#> # Groups: episode [176]
#> episode character n
#> <chr> <chr> <int>
#> 1 All Good Things... PICARD 348
#> 2 Encounter at Farpoint PICARD 224
#> 3 Interface GEORDI 197
#> 4 Future Imperfect RIKER 183
#> 5 Frame of Mind RIKER 177
#> 6 The Outcast RIKER 173
#> 7 Suspicions BEVERLY 172
#> 8 Captain's Holiday PICARD 171
#> 9 Bloodlines PICARD 168
#> 10 Remember Me BEVERLY 165
#> # … with 4,217 more rows
The Deep Space Nine series is also available:
ds9$Chimera
#> # A tibble: 415 x 6
#> id perspective setting character description line
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 79 2 INT. RUNA… ODO is in the c… O'BRIEN (moving to t… How long wa…
#> 2 81 2 INT. RUNA… ODO is in the c… ODO <NA> Almost two …
#> 3 86 2 INT. RUNA… O'Brien's surpr… O'BRIEN (noticing) You dropped…
#> 4 90 2 INT. RUNA… O'Brien's surpr… ODO (nods) We entered …
#> 5 94 2 INT. RUNA… O'Brien's surpr… O'BRIEN (taking a se… What's that?
#> 6 96 2 INT. RUNA… O'Brien's surpr… ODO <NA> The shopkee…
#> 7 99 2 INT. RUNA… O'Brien's surpr… O'BRIEN <NA> I didn't kn…
#> 8 104 2 CONTINUED: O'Brien's surpr… ODO <NA> It's a pres…
#> 9 108 2 CONTINUED: O'Brien's featu… ODO (misundersta… You don't t…
#> 10 110 2 CONTINUED: O'Brien's featu… O'BRIEN <NA> I'm sure sh…
#> # … with 405 more rows
If you want both datasets together, one approach might be to created a nested data frame:
all_episodes <- function(.data, series_name) {
.data %>%
bind_rows(.id = "episode") %>%
mutate(series = series_name) %>%
select(series, everything())
}
tng_all <- all_episodes(tng, "TNG")
ds9_all <- all_episodes(ds9, "DS9")
bind_rows(tng_all, ds9_all) %>%
group_by(series, episode) %>%
nest()
#> # A tibble: 349 x 3
#> series episode data
#> <chr> <chr> <list>
#> 1 TNG Encounter at Farpoint <tibble [805 × 6]>
#> 2 TNG The Naked Now <tibble [405 × 6]>
#> 3 TNG Code of Honor <tibble [438 × 6]>
#> 4 TNG Haven <tibble [421 × 6]>
#> 5 TNG Where None Have Gone Before <tibble [409 × 6]>
#> 6 TNG The Last Outpost <tibble [493 × 6]>
#> 7 TNG Lonely Among Us <tibble [450 × 6]>
#> 8 TNG Justice <tibble [452 × 6]>
#> 9 TNG The Battle <tibble [523 × 6]>
#> 10 TNG Hide And Q <tibble [363 × 6]>
#> # … with 339 more rows
The columns have been arranged in a specific order to read from left to
right or when using glimpse()
, top to bottom. For example:
ds9$Chimera %>%
.[5, ] %>%
glimpse()
#> Observations: 1
#> Variables: 6
#> $ id <int> 94
#> $ perspective <chr> "2 INT. RUNABOUT"
#> $ setting <chr> "O'Brien's surprised to hear he was asleep that long…
#> $ character <chr> "O'BRIEN"
#> $ description <chr> "(taking a seat)"
#> $ line <chr> "What's that?"
The raw text files were parsed using the scripts found in the data-raw folder of this repository. Below is a visual explanation:
ds9$Emissary %>%
.[26, ] %>%
glimpse()
#> Observations: 1
#> Variables: 6
#> $ id <int> 289
#> $ perspective <chr> "10 INT. SISKO'S QUARTERS (OPTICAL)"
#> $ setting <chr> "Destroyed... an explosion has ripped a hole in the …
#> $ character <chr> "SISKO"
#> $ description <chr> "(calm, controlled)"
#> $ line <chr> "It's gonna be okay... I'll get you out of there...…
- Transcripts were taken from Star Trek Minutiae