-
Notifications
You must be signed in to change notification settings - Fork 0
/
exploring-rtweet.Rmd
264 lines (210 loc) Β· 12.3 KB
/
exploring-rtweet.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
title: "Exploring rtweet"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>",
out.width = "100%"
)
options(width = 110)
library(tidyverse)
library(ggridges)
library(infer)
library(janitor)
library(magick)
library(rtweet)
library(here)
```
## Machine learning flashcards
Looking for a way to reinforce machine learning concepts, I happened upon the Chris Albon's [Machine Learning Flashcards](https://machinelearningflashcards.com/) on Twitter. And, after reading the accompanying website, I noticed that Chris links to [a Python repo](https://github.com/Dpananos/GetCards) that scrapes the images from Chris's Twitter feed. I thought I would try to do the same using R so here we go:
### Pull tweets and inspect the output
First, we have a go at pulling tweets from Chris's feed via `rtweet`:
```{r}
albon_tweets <- get_timeline(user = "chrisalbon", n = 3200) # max return value
head(albon_tweets)
```
Okay, after reading Twitter's standard search API [documentation](https://developer.twitter.com/en/docs/tweets/search/overview/standard) we see that the standard search API will only return a sampling of the user's Tweets published in the past **7 days**. So, we will be retrieving just a sample of Chris's handy flashcards.
Also, there were nearly 90 variables returned so let's take a `glimpse` and zero in on what is essential for this specific task:
```{r}
glimpse(albon_tweets)
```
Right off the bat I see `machinelearningflashcards.com` under the `urls_url` column which will helpful in filtering the data. Additionally, `media_url` conveniently contains the flashcard image url.
I will be sure to grab those variables in addition to the Twitter handle and accompany text for reference. Using a series of `stringr` functions, I strip out the name of the flashcard from `text` to be used later on.
Lastly, having a glance at the `flashcard_name` column shows that some flashcards are repeated. So, let's use `distinct` to keep only the unique records, noting the convenient `.keep_all = TRUE` argument to retain all columns of the dataframe.
```{r}
flash_df <- albon_tweets %>%
select(screen_name, text, urls_url, media_url) %>%
unnest(urls_url, .preserve = media_url) %>%
filter(str_detect(urls_url, "machinelearning"), # keep tweets containing the flashcard url
!is.na(media_url)) %>% # drop any tweets lacking an image
unnest(media_url) %>%
mutate(
flashcard_name = text %>%
str_extract(".+?(?=\\shttps)") %>%
str_to_lower() %>%
str_replace_all("\\s", "-")
) %>%
distinct(flashcard_name, .keep_all = TRUE)
flash_df
```
### Read and write the flashcards
Now that we have the `flash_df` dataframe containing the image URLs and names, let's make a quick function to read and write the flashcard images using `magick` and then feed those parameters into `pwalk` to iterate through the flashcards that we identified above:
```{r, eval=FALSE}
grab_flash <- function(flash_url, flash_name, folder) {
flash_url %>%
image_read() %>%
image_write(here(folder, str_c(flash_name,".png")))
}
params <- list(pull(flash_df, media_url), pull(flash_df, flashcard_name))
pwalk(params, ~grab_flash(.x, .y, "ml-flashcard-images"))
```
## You draw an X how?!?!
Trolling through [#rstats](https://twitter.com/hashtag/rstats) Twitter, I came across this tweet soliciting responses on how people draw an X:
```{r echo=FALSE, out.width="40%"}
knitr::include_graphics("how-to-draw-x_image.png")
```
Interestingly, @SMASEY observed that: "General consensus is that Americans do 7 & 8 while UK does 5 & 6. Probably how we were taught." Is that the case? Let's see what the data say.
### Pull responses and wrangle a tidy dataframe
The standard search API only returns data from the previous 7 days which presents a problem as this tweet is from January 20, 2019. There is, however, a way forward using the [30-Day API](https://developer.twitter.com/en/docs/tweets/search/quick-start/premium-30-day) which requires one to register a developer account, a registered app, and a developer environment setup. [See here for a walkthrough of that process](https://rud.is/books/21-recipes/using-oauth-to-access-twitter-apis.html). Note that while we can still search for free, we are now constrained to 100 tweets per request with a cap of 250 requests per month.
Digging into Twitter's historical data means that we will have to leave the simplicity of `rtweet` behind and roll our own function to pull data from the [30-Day API](https://developer.twitter.com/en/docs/tweets/search/quick-start/premium-30-day).
```{r, eval=FALSE}
pull_tweets <- function(url, # 30-day search stem
dev_env, # <YOUR_DEV_ENV_NAME>
tkn, # <YOUR_BEARER_TOKEN> (see rtweet::bearer_token)
search = NULL, # search terms
start_date = NULL, # <YYYYMMDDHHmm>
stop_date = NULL, # <YYYYMMDDHHmm>
max_req = NULL # integer to limit requests (250/month cap)
) {
# construct url
thirty_url <- str_c(url, dev_env, ".json")
# inital call
res <- GET(thirty_url,
query = list(query = search, fromDate = start_date, toDate = stop_date),
add_headers(Authorization = tkn))
out <- fromJSON(read_lines(res[["content"]]), flatten = TRUE) %>% .[['results']]
nxt_tkn <- fromJSON(read_lines(res[["content"]])) %>% .[['next']]
output_init <- list(list(df = out, nxt = nxt_tkn))
print("call_1")
# loop until max_req limit or `next` token unavailable
i <- 1
output_loop <- list()
while (!is.null(nxt_tkn) && i <= max_req - 1) {
res <- GET(thirty_url,
query = list(query = search, fromDate = start_date, toDate = stop_date, `next` = nxt_tkn),
add_headers(Authorization = tkn))
out <- fromJSON(read_lines(res[["content"]]), flatten = TRUE) %>% .[['results']]
nxt_tkn <- fromJSON(read_lines(res[["content"]])) %>% .[['next']]
output_loop[[i]] <- list(df = out, nxt = nxt_tkn)
i <- i + 1
print(str_c("call_", i))
}
append(output_init, output_loop)
}
```
Let's use `pull_tweets` to grab all responding tweets from the US and the UK. Boilerplate code is provided below should you wish to try using your own credentials.
```{r, eval=FALSE}
pull_tweets(url = "https://api.twitter.com/1.1/tweets/search/30day/",
dev_env = <YOUR_DEV_ENV_NAME>,
tkn = <YOUR_BEARER_TOKEN>,
search = "to:SMASEY place_country:US", # `place_country:GB` for UK tweets
max_req = 10) # set a limit on number of requests
```
Here we read in the data collected using `pull_tweets`; wrangle the output into a tidy dataframe; and, use some coarse `regex` to extract the answer which should be an integer between 1 and 8.
```{r, cache=TRUE}
tweets_df <- fs::dir_ls(here("how-to-draw-x_data"), glob = "*.rds") %>%
map(read_rds) %>% # read in output files
map_depth(2, "df") %>% # grab df inside each nested list
flatten_dfr() %>% # flatten list and row bind
filter(str_detect(text, "@SMASEY\\shttps+", negate = TRUE)) %>%
mutate(ans = text %>%
str_extract("\\s\\d{1}(?!\\d)") %>%
str_squish()) %>%
na.omit()
```
### Time to plot the data
In total, it looks like we grabbed `r nrow(tweets_df)` tweets: `r nrow(tweets_df[tweets_df$country_code == "US", ])` from the US and `r nrow(tweets_df[tweets_df$country_code == "GB", ])` from the UK. Let's take a quick look at the distribution of answer choices by country.
```{r, out.width="75%", fig.align="center"}
# histograms
ggplot(tweets_df, aes(x = as.numeric(ans), fill = country_code)) +
geom_histogram(binwidth = 1, color = "white") +
geom_text(stat = 'count', aes(label = ..count.., vjust = -0.2)) +
scale_x_continuous(breaks = seq(1, 8, 1)) +
coord_cartesian(clip = "off") +
facet_wrap(~country_code, nrow = 2) +
theme_minimal(base_size = 10) +
labs(x = "answer choice", y = "count") +
guides(fill = FALSE, color = FALSE)
```
### Is location associated with how one draws an X?
Okay, it is clear that 7 & 8 are the most popular answers in both countries followed by 5 & 6. Answer choices 1-4 are infrequent (fewer than ~4% of responses in both cases). However, it is not clear what, if any, association exists between country and answer choice.
Let's use the `infer` package to run a simulation-based test to investigate whether there is an association between location and answer choice. The `infer` package's intuitive design makes it straightforward for us to: (1) calculate our chi-squared statistic; (2) simulate a null distribution through permutation; and, (3) calculate the proportion of replicates that had a chi-squared as or more extreme than the observed statistic to determine significance.
```{r, out.width="75%", fig.align="center"}
# for reproducibility
set.seed(2)
# set up df
chisq_df <- tweets_df %>%
mutate(ans_fct = fct_lump(ans, prop = 0.05), # use fct_lump() for infrequent factor levels
country_code = factor(country_code))
# calculate test statistic
obs_chisq <- chisq_stat(chisq_df, ans_fct ~ country_code)
obs_chisq
# generate null distribution
null_dist_chisq <- chisq_df %>%
specify(ans_fct ~ country_code) %>%
hypothesize(null = "independence") %>%
generate(reps = 3000, type = "permute") %>%
calculate(stat = "Chisq")
# inspect test statistic within null distribution
visualize(null_dist_chisq) +
shade_p_value(obs_stat = obs_chisq, direction = "greater")
```
```{r}
# grab p-value
pval_chisq <- get_p_value(null_dist_chisq, obs_stat = obs_chisq, direction = "greater")
pval_chisq
```
After generating a simulation-based null distribution of chi-squared statistics, we see that there is a `r str_c(round(pval_chisq * 100, 2), "%")` chance of observing a chi-squared value at least as large as `r round(pull(obs_chisq, stat), 1)` in a world where there's no difference between `country` and `ans`. So, we observe strong evidence in support of a significant association between location and how one draws an X.
### What about # 7 & 8 being particularly American?
Well, we've see that there is some association with how an X is drawn and location, so let's be a bit more specific and tease apart this observation that 7 & 8 might be particularly American. To do this we will set up a dataframe with a collapsed factor variable for answers 7 & 8, take a peek at the proportions of answers 7 & 8 by location, and then test any observed difference in proportions across countries.
```{r}
# set up df
props_df <- tweets_df %>%
mutate(ans_clps = fct_collapse(ans,
one_six = c("1","2","3","4","5","6"),
svn_egt = c("7","8")),
country_code = factor(country_code))
# calculate counts and proportions
props_df %>%
tabyl(country_code, ans_clps) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns()
```
We observe a difference of ~17 percentage points for answers # 7 & 8 between US and UK responses. Let's employ the same `infer` workflow from before to implement a simulation-based test on this observed difference in proportions.
```{r, out.width="75%", fig.align="center"}
# for reproducibility
set.seed(20)
# calculate test statistic
d_hat <- props_df %>%
specify(ans_clps ~ country_code, success = "svn_egt") %>%
calculate(stat = "diff in props", order = c("US", "GB"))
# generate null distribution
null_dist_props <- props_df %>%
specify(ans_clps ~ country_code, success = "svn_egt") %>%
hypothesize(null = "independence") %>%
generate(reps = 3000, type = "permute") %>%
calculate(stat = "diff in props", order = c("US", "GB"))
# inspect test statistic within null distribution
visualize(null_dist_props) +
shade_p_value(obs_stat = d_hat, direction = "two_sided")
```
```{r}
# grab p-value
pval_props <- get_p_value(null_dist_props, obs_stat = d_hat, direction = "two_sided")
pval_props
```
We see that there is a `r str_c(round(pval_props * 100, 2), "%")` chance of a test statistic at least as extreme as Β±`r round(pull(d_hat, stat), 3)` in a world where there is no difference in proportions by location. So, we observe strong evidence in support of a significant difference in responses of 7 & 8 by location, with a higher proportion of 7 & 8's from the US.