exploring-rtweet.Rmd

---
title: "Exploring rtweet"
output: github_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  collapse = TRUE,
  comment = "#>",
  out.width = "100%"
  )
options(width = 110)

library(tidyverse)
library(ggridges)
library(infer)
library(janitor)
library(magick)
library(rtweet)
library(here)
```

## Machine learning flashcards
Looking for a way to reinforce machine learning concepts, I happened upon the Chris Albon's [Machine Learning Flashcards](https://machinelearningflashcards.com/) on Twitter. And, after reading the accompanying website, I noticed that Chris links to [a Python repo](https://github.com/Dpananos/GetCards) that scrapes the images from Chris's Twitter feed. I thought I would try to do the same using R so here we go:

### Pull tweets and inspect the output
First, we have a go at pulling tweets from Chris's feed via `rtweet`:

```{r}
albon_tweets <- get_timeline(user = "chrisalbon", n = 3200)  # max return value
head(albon_tweets)
```

Okay, after reading Twitter's standard search API [documentation](https://developer.twitter.com/en/docs/tweets/search/overview/standard) we see that the standard search API will only return a sampling of the user's Tweets published in the past **7 days**. So, we will be retrieving just a sample of Chris's handy flashcards.

Also, there were nearly 90 variables returned so let's take a `glimpse` and zero in on what is essential for this specific task:

```{r}
glimpse(albon_tweets)
```

Right off the bat I see `machinelearningflashcards.com` under the `urls_url` column which will helpful in filtering the data. Additionally, `media_url` conveniently contains the flashcard image url. 

I will be sure to grab those variables in addition to the Twitter handle and accompany text for reference. Using a series of `stringr` functions, I strip out the name of the flashcard from `text` to be used later on.

Lastly, having a glance at the `flashcard_name` column shows that some flashcards are repeated. So, let's use `distinct` to keep only the unique records, noting the convenient `.keep_all = TRUE` argument to retain all columns of the dataframe.

```{r}
flash_df <- albon_tweets %>% 
  select(screen_name, text, urls_url, media_url) %>% 
  unnest(urls_url, .preserve = media_url) %>%
  filter(str_detect(urls_url, "machinelearning"),    # keep tweets containing the flashcard url
         !is.na(media_url)) %>%                      # drop any tweets lacking an image
  unnest(media_url) %>% 
  mutate(
    flashcard_name = text %>% 
      str_extract(".+?(?=\\shttps)") %>% 
      str_to_lower() %>% 
      str_replace_all("\\s", "-")
    ) %>% 
  distinct(flashcard_name, .keep_all = TRUE)

flash_df
```

### Read and write the flashcards
Now that we have the `flash_df` dataframe containing the image URLs and names, let's make a quick function to read and write the flashcard images using `magick` and then feed those parameters into `pwalk` to iterate through the flashcards that we identified above:

```{r, eval=FALSE}
grab_flash <- function(flash_url, flash_name, folder) {
  flash_url %>% 
    image_read() %>% 
    image_write(here(folder, str_c(flash_name,".png")))
}

params <- list(pull(flash_df, media_url), pull(flash_df, flashcard_name)) 
pwalk(params, ~grab_flash(.x, .y, "ml-flashcard-images"))
```

## You draw an X how?!?!
Trolling through [#rstats](https://twitter.com/hashtag/rstats) Twitter, I came across this tweet soliciting responses on how people draw an X:

```{r echo=FALSE, out.width="40%"}
knitr::include_graphics("how-to-draw-x_image.png")
```

Interestingly, @SMASEY observed that: "General consensus is that Americans do 7 & 8 while UK does 5 & 6. Probably how we were taught." Is that the case? Let's see what the data say.

### Pull responses and wrangle a tidy dataframe
The standard search API only returns data from the previous 7 days which presents a problem as this tweet is from January 20, 2019. There is, however, a way forward using the [30-Day API](https://developer.twitter.com/en/docs/tweets/search/quick-start/premium-30-day) which requires one to register a developer account, a registered app, and a developer environment setup. [See here for a walkthrough of that process](https://rud.is/books/21-recipes/using-oauth-to-access-twitter-apis.html). Note that while we can still search for free, we are now constrained to 100 tweets per request with a cap of 250 requests per month.

Digging into Twitter's historical data means that we will have to leave the simplicity of `rtweet` behind and roll our own function to pull data from the [30-Day API](https://developer.twitter.com/en/docs/tweets/search/quick-start/premium-30-day).

```{r, eval=FALSE}
pull_tweets <- function(url,                 # 30-day search stem
                        dev_env,             # <YOUR_DEV_ENV_NAME>
                        tkn,                 # <YOUR_BEARER_TOKEN> (see rtweet::bearer_token)
                        search = NULL,       # search terms   
                        start_date = NULL,   # <YYYYMMDDHHmm>
                        stop_date = NULL,    # <YYYYMMDDHHmm>
                        max_req = NULL       # integer to limit requests (250/month cap)
                        ) {
  # construct url
  thirty_url <- str_c(url, dev_env, ".json")
  
  # inital call
  res <- GET(thirty_url,
             query = list(query = search, fromDate = start_date, toDate = stop_date),
             add_headers(Authorization = tkn))
  
  out <- fromJSON(read_lines(res[["content"]]), flatten = TRUE) %>% .[['results']]
  nxt_tkn <- fromJSON(read_lines(res[["content"]])) %>% .[['next']]
  output_init <- list(list(df = out, nxt = nxt_tkn))
  
  print("call_1")
  
  # loop until max_req limit or `next` token unavailable
  i <- 1
  output_loop <- list()
  
  while (!is.null(nxt_tkn) && i <= max_req - 1) {
    res <- GET(thirty_url,
               query = list(query = search, fromDate = start_date, toDate = stop_date, `next` = nxt_tkn),
               add_headers(Authorization = tkn))
    
    out <- fromJSON(read_lines(res[["content"]]), flatten = TRUE) %>% .[['results']]
    nxt_tkn <- fromJSON(read_lines(res[["content"]])) %>% .[['next']]
    output_loop[[i]] <- list(df = out, nxt = nxt_tkn)
    
    i <- i + 1
    print(str_c("call_", i))
  }
  append(output_init, output_loop)
}
```

Let's use `pull_tweets` to grab all responding tweets from the US and the UK. Boilerplate code is provided below should you wish to try using your own credentials.

```{r, eval=FALSE}
pull_tweets(url = "https://api.twitter.com/1.1/tweets/search/30day/",
            dev_env = <YOUR_DEV_ENV_NAME>,
            tkn = <YOUR_BEARER_TOKEN>,
            search = "to:SMASEY place_country:US",  # `place_country:GB` for UK tweets
            max_req = 10)                           # set a limit on number of requests
```

Here we read in the data collected using `pull_tweets`; wrangle the output into a tidy dataframe; and, use some coarse `regex` to extract the answer which should be an integer between 1 and 8.

```{r, cache=TRUE}
tweets_df <- fs::dir_ls(here("how-to-draw-x_data"), glob = "*.rds") %>% 
  map(read_rds) %>%       # read in output files
  map_depth(2, "df") %>%  # grab df inside each nested list
  flatten_dfr() %>%       # flatten list and row bind
  filter(str_detect(text, "@SMASEY\\shttps+", negate = TRUE)) %>% 
  mutate(ans = text %>% 
           str_extract("\\s\\d{1}(?!\\d)") %>% 
           str_squish()) %>% 
  na.omit()
```

### Time to plot the data
In total, it looks like we grabbed `r nrow(tweets_df)` tweets: `r nrow(tweets_df[tweets_df$country_code == "US", ])` from the US and `r nrow(tweets_df[tweets_df$country_code == "GB", ])` from the UK. Let's take a quick look at the distribution of answer choices by country.

```{r, out.width="75%", fig.align="center"}
# histograms
ggplot(tweets_df, aes(x = as.numeric(ans), fill = country_code)) + 
  geom_histogram(binwidth = 1, color = "white") + 
  geom_text(stat = 'count', aes(label = ..count.., vjust = -0.2)) + 
  scale_x_continuous(breaks = seq(1, 8, 1)) + 
  coord_cartesian(clip = "off") + 
  facet_wrap(~country_code, nrow = 2) + 
  theme_minimal(base_size = 10) + 
  labs(x = "answer choice", y = "count") + 
  guides(fill = FALSE, color = FALSE)
```

### Is location associated with how one draws an X?
Okay, it is clear that 7 & 8 are the most popular answers in both countries followed by 5 & 6. Answer choices 1-4 are infrequent (fewer than ~4% of responses in both cases). However, it is not clear what, if any, association exists between country and answer choice. 

Let's use the `infer` package to run a simulation-based test to investigate whether there is an association between location and answer choice. The `infer` package's intuitive design makes it straightforward for us to: (1) calculate our chi-squared statistic; (2) simulate a null distribution through permutation; and, (3) calculate the proportion of replicates that had a chi-squared as or more extreme than the observed statistic to determine significance.

```{r, out.width="75%", fig.align="center"}
# for reproducibility
set.seed(2)

# set up df
chisq_df <- tweets_df %>% 
  mutate(ans_fct = fct_lump(ans, prop = 0.05), # use fct_lump() for infrequent factor levels
         country_code = factor(country_code))

# calculate test statistic
obs_chisq <- chisq_stat(chisq_df, ans_fct ~ country_code)
obs_chisq

# generate null distribution
null_dist_chisq <- chisq_df %>% 
  specify(ans_fct ~ country_code) %>% 
  hypothesize(null = "independence") %>%
  generate(reps = 3000, type = "permute") %>%
  calculate(stat = "Chisq")

# inspect test statistic within null distribution
visualize(null_dist_chisq) + 
  shade_p_value(obs_stat = obs_chisq, direction = "greater")
```

```{r}
# grab p-value
pval_chisq <- get_p_value(null_dist_chisq, obs_stat = obs_chisq, direction = "greater")
pval_chisq
```

After generating a simulation-based null distribution of chi-squared statistics, we see that there is a `r str_c(round(pval_chisq * 100, 2), "%")` chance of observing a chi-squared value at least as large as `r round(pull(obs_chisq, stat), 1)` in a world where there's no difference between `country` and `ans`. So, we observe strong evidence in support of a significant association between location and how one draws an X.

### What about # 7 & 8 being particularly American?
Well, we've see that there is some association with how an X is drawn and location, so let's be a bit more specific and tease apart this observation that 7 & 8 might be particularly American. To do this we will set up a dataframe with a collapsed factor variable for answers 7 & 8, take a peek at the proportions of answers 7 & 8 by location, and then test any observed difference in proportions across countries.

```{r}
# set up df
props_df <- tweets_df %>% 
  mutate(ans_clps = fct_collapse(ans,
                                 one_six = c("1","2","3","4","5","6"),
                                 svn_egt = c("7","8")),
         country_code = factor(country_code))

# calculate counts and proportions
props_df %>% 
  tabyl(country_code, ans_clps) %>% 
  adorn_percentages("row") %>% 
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns()
```

We observe a difference of ~17 percentage points for answers # 7 & 8 between US and UK responses. Let's employ the same `infer` workflow from before to implement a simulation-based test on this observed difference in proportions.

```{r, out.width="75%", fig.align="center"}
# for reproducibility
set.seed(20)

# calculate test statistic
d_hat <- props_df %>% 
  specify(ans_clps ~ country_code, success = "svn_egt") %>%
  calculate(stat = "diff in props", order = c("US", "GB"))

# generate null distribution
null_dist_props <- props_df %>%
  specify(ans_clps ~ country_code, success = "svn_egt") %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 3000, type = "permute") %>% 
  calculate(stat = "diff in props", order = c("US", "GB"))

# inspect test statistic within null distribution
visualize(null_dist_props) +
  shade_p_value(obs_stat = d_hat, direction = "two_sided")
```

```{r}
# grab p-value
pval_props <- get_p_value(null_dist_props, obs_stat = d_hat, direction = "two_sided")
pval_props
```

We see that there is a `r str_c(round(pval_props * 100, 2), "%")` chance of a test statistic at least as extreme as ±`r round(pull(d_hat, stat), 3)` in a world where there is no difference in proportions by location. So, we observe strong evidence in support of a significant difference in responses of 7 & 8 by location, with a higher proportion of 7 & 8's from the US.