delta8tweets.Rmd

---
title: "delta8tweets"
author: "Drew Walker"
date: "6/9/2021"
output: html_document
bibliography: references.bib
---

# Topic modeling tweets

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
#remotes::install_github("slu-openGIS/postmastr")
library(postmastr)
library(textmineR)
library(tidyverse)
library(tidytext)
library(knitr) #Create nicely formatted output tables
library(kableExtra) #Create nicely formatted output tables
library(formattable) #For the color_tile function
library(lubridate)
library(academictwitteR)
library(tm)
library(tidygeocoder)
library(ggpubr)

APIToken <- read_csv("apikeys.csv")
api_key <- as.character(APIToken[1,1])
api_secret_key <- as.character(APIToken[1,2])
bearer_token <- as.character(APIToken[1,3])
```

# Resources for topic modeling

Tidy Text Mining With R (Julia Silge) <https://www.tidytextmining.com/topicmodeling.html>

Predominantly used: has some older code that i had to fix, may try to convert to Julia's topicmodels package <https://towardsdatascience.com/beginners-guide-to-lda-topic-modelling-with-r-e57a5a8e7a25>

# Pull tweet full archive tweet data

-   Add search by hashtag

```{r apitokens}


# d8_tweets_jan_2021 <-
#   get_all_tweets("delta 8 thc",
#                  "2021-01-01T00:00:00Z",
#                  "2021-06-09T00:00:00Z",
#                  bearer_token)
# 
# write_rds(d8_tweets_jan_2021, "d8_tweets_since_jan_2021.rds")


d8_tweets_jan_2020 <-  get_all_tweets("delta 8 thc",
                 "2020-01-01T00:00:00Z",
                 "2021-06-30T00:00:00Z",
                 bearer_token)

write_rds(d8_tweets_jan_2020, "d8_tweets_since_jan_2020.rds")


users_d8_tweets_jan_2020 <-
  get_user_profile(unique(d8_tweets_jan_2020$author_id), bearer_token)

users_d8_tweets_jan_2020 <- users_d8_tweets_jan_2020 %>% 
  rename(author_id = id)


d8_tweets_and_user_info_jan_2020 <- left_join(d8_tweets_jan_2020, users_d8_tweets_jan_2020, by = "author_id")

write_rds(d8_tweets_and_user_info_jan_2020, "d8_tweets_and_user_info_jan_2020.rds")

#Get tweets with Delta 8 thc or delta 8 hashtag

hashtag_delta8_tweets_jan_2020 <-  get_all_tweets("#delta8",
                 "2020-01-01T00:00:00Z",
                 "2021-06-30T00:00:00Z",
                 bearer_token)
users_hashtag_d8_tweets_jan_2020 <-
  get_user_profile(unique(hashtag_delta8_tweets_jan_2020$author_id), bearer_token)

users_hashtag_d8_tweets_jan_2020 <- users_hashtag_d8_tweets_jan_2020 %>% 
  rename(author_id = id)


hashtag_delta8_tweets_and_user_info_jan_2020 <- left_join(hashtag_delta8_tweets_jan_2020, users_hashtag_d8_tweets_jan_2020, by = "author_id")

write_rds(hashtag_delta8_tweets_and_user_info_jan_2020, "hashtag_delta8_tweets_and_user_info_jan_2020.rds")
```

# Tweet data NLP and trend analysis

```{r delta8tweets,}
d8_tweets <- readRDS("d8_tweets_and_user_info_jan_2020.rds")
d8_tweets <- d8_tweets
```

# prepping for nlp

Decision was made to remove stopwords post hoc, due to inclusion of bigrams, and recent reseearch recommending for adhoc removeal instead. [@schofield2017]

-   [@boon-itt2020] removed urls, emojis, special characters, retweets, hashtag symbols, hyperlinks.

    -   Also removed stopwords, lower cased, and words like corona and virus removed

```{r, preprocessing-tweets}

d8_tweets_text_id <- d8_tweets %>% 
  select(text,id)

#Token unnesting 

#Remove @ and RT tweet notation
d8_tweets$text <- gsub("RT.*:", "", d8_tweets$text)
d8_tweets$text <- gsub("@.* ", "", d8_tweets$text)
# sub out digits , punctuation
#Should digits be included in case of delta 8, delta 10? 
#text_cleaning_tokens$word <- gsub('[[:digit:]]+', '', text_cleaning_tokens$word)

d8_tweets$text <- gsub('[[:punct:]]+', '', d8_tweets$text)
text_cleaning_tokens <- d8_tweets %>% 
  tidytext::unnest_tokens(word, text) %>% 
  left_join(d8_tweets_text_id, by = "id") %>% 
  mutate(raw_text = text)
#remove words? like 
text_cleaning_tokens$word <- gsub('https', '', text_cleaning_tokens$word)
#remove anything where word is only 1 character like a i d, remove stopwords
#text_cleaning_tokens <- text_cleaning_tokens %>% filter(!(nchar(word) == 1))%>% 
  #anti_join(stop_words)
#Stem/lemmatizer?
# https://blogs.cornell.edu/cornellnlp/2019/02/09/choose-your-words-wisely-for-topic-models/ 
# may not need to, is often done to save resources, or combine multiple words to mean same thing. May try to do as a sensitivity check 

#remove commonly occurring words


#remove spaces
tokens <- text_cleaning_tokens %>% filter(!(word==""))
tokens <- tokens %>% mutate(ind = row_number())
tokens <- tokens %>% group_by(id) %>% mutate(ind = row_number()) %>%
  tidyr::pivot_wider(names_from = ind, values_from = word)

https_check <- text_cleaning_tokens %>% 
  filter(word == "https")
```

# Create document term matrix

```{r, create-dtm}
#create DTM
dtm <- CreateDtm(tokens$text, 
                 doc_names = tokens$id, 
                 ngram_window = c(1, 2))
#explore the basic frequency
tf <- TermDocFreq(dtm = dtm)
original_tf <- tf %>% select(term, term_freq,doc_freq)
rownames(original_tf) <- 1:nrow(original_tf)
# Eliminate words appearing less than 2 times or in more than half of the
# documents
#vocabulary <- tf$term[tf$term_freq > 1 & tf$doc_freq < nrow(dtm) / 2]
vocabulary <- tf$term[tf$term_freq > 50]
```

# Running lda on up to 20 clusters, determining number of ks for analysis by highest coherence score

-   We evaluated lda models using 1-20 clusters, and chose to conduct the analysis using 3 clusters due to the highest coherence score.

-   We tried others as a sensitivity analysis

```{r, lda}
set.seed(550055)
k_list <- seq(1, 20, by = 1)
model_dir <- paste0("models_", digest::digest(vocabulary, algo = "sha1"))
dir.create(model_dir)
model_list <- TmParallelApply(X = k_list, FUN = function(k){
  filename = file.path(model_dir, paste0(k, "_topics.rda"))
  
  if (!file.exists(filename)) {
    m <- FitLdaModel(dtm = dtm, k = k, iterations = 500)
    m$k <- k
    m$coherence <- CalcProbCoherence(phi = m$phi, dtm = dtm, M = 5)
    save(m, file = filename)
  } else {
    load(filename)
  }
  
  m
}, export=c("dtm", "model_dir")) # export only needed for Windows machines
#model tuning
#choosing the best model
coherence_mat <- data.frame(k = sapply(model_list, function(x) nrow(x$phi)), 
                            coherence = sapply(model_list, function(x) mean(x$coherence)), 
                            stringsAsFactors = FALSE)
ggplot(coherence_mat, aes(x = k, y = coherence)) +
  geom_point() +
  geom_line(group = 1)+
  ggtitle("Best Topic by Coherence Score") + theme_minimal() +
  scale_x_continuous(breaks = seq(1,20,1)) + ylab("Coherence")
```

# LDA on highest coherence model

```{r, lda}
# get top 
#model <- model_list[which.max(coherence_mat$coherence)][[1]]
#model <- model_list[which.max(coherence_mat$coherence)][[1]]
model <- model_list[[4]]
model$top_terms <- GetTopTerms(phi = model$phi, M = 25)
top20_wide <- as.data.frame(model$top_terms)
kable(top20_wide)
```

# assessing prevalence of topic

```{r, prevalence}
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100

# prevalence should be proportional to alpha
plot(model$prevalence, model$alpha, xlab = "prevalence", ylab = "alpha")

```

# generating label topics

A label topic procedure was used using the textmineR GetProbableTerms function, which extracted terms (uni and bigrams) which are "more probable [in a set of topics] than a corpus overall".

```{r, coherence-table-modeling}
model$labels <- LabelTopics(assignments = model$theta > 0.05, 
                            dtm = dtm,
                            M = 1)

head(model$labels)


# put them together, with coherence into a summary table
model$summary <- data.frame(topic = rownames(model$phi),
                            label = model$labels,
                            coherence = round(model$coherence,3),
                            prevalence = round(model$prevalence,3),
                            top_terms = apply(model$top_terms, 2, function(x){
                              paste(x, collapse = ", ")
                            }),
                            stringsAsFactors = FALSE)

model$summary
```

# Sentiment Analysis

```{r, nrc}
nrc <- get_sentiments("nrc") # get specific sentiment lexicons in a tidy format

nrc_words <- tf %>%
  rename(word = term) %>% 
  inner_join(nrc, by="word")

pie_words<- nrc_words %>%
  group_by(sentiment) %>% # group by sentiment type
  tally %>% # counts number of rows
  arrange(desc(n)) # arrange sentiments in descending order based on frequency

ggpubr::ggpie(pie_words, "n", label = "sentiment", 
      fill = "sentiment", color = "white", 
      palette = "Spectral")
```