notebook_twitter.qmd

---
title: "Workshop Analyzing Social Media Data I : Twitter Data"
author: "Tiago Ventura"
date: "10/21/2020"
toc: true
format:
  html:
    theme: cosmo
    html-math-method: katex
    code-tools: true
    self-contained: true
execute:
  warning: false
  message: false
  error: false
  cache: true
editor_options: 
  chunk_output_type: console
---

## Introduction

Social media data can come on many flavors and from many different sources. For this reason, it is not possible to cover in-depth in a one-hour workshop many different types of social media data. In addition, social media companies provide different access to researchers to their data.

There is not a "one way to rule them all" when it comes to working with social media data.

For this reason, I decided start this workshop on the most used, and the most easily accessible social media data for researchers: Twitter data. However, even though we will spend most time working with Twitter data,  several of the techniques I hope to cover here are no restricted to Twitter data. These are general techniques and can be applied pretty much to any type of social media data.

With Twitter data, we will cover:

-   Analyzing and understanding network structure of social media data.
-   Some extra endpoints from the Twitter API (timelines, friends list, among others). 

To save some time, I stored all the data I am using in this tutorial [here](https://www.dropbox.com/s/ifevgl21kjotdc9/workshop_data_compress.zip?dl=0). This notebook should run if you place all the data in the same working directory. 

## Getting Access to the Twitter APIs.

In order to get access to Twitter data, you need to first [apply for a Twitter developer account](https://developer.twitter.com/en/apply-for-access). Once your developer application has been approved, you get access to the standard product track by default. However, if you are an academic researcher and meet certain requirements, you can [apply to the academic research product track](https://developer.twitter.com/en/portal/petition/academic/is-it-right-for-you) which will give you elevated access to the Twitter API v2 including access to historical public Tweets for free.

We will be using the academic research access to the Twitter V2 API.

### Standard Access

- Search for Tweets from the last 7 days by specifying queries using supported operators (more on building queries in later sections)
- Stream Tweets in real-time as they are happening by specifying rules to filter for Tweets that you are interested in.
- Get Tweets from a user’s timeline (up to 3200 most recent Tweets)
- Build the full Tweet objects from a Tweet ID, or a set of Tweet IDs
- Look up follower relationships

These are just some examples of what you can get from the standard product track, relevant to academics. 

Currently, you can get upto 500,000 Tweets per month using the standard product track and this limit does not apply to the sampled stream endpoint, which gives a 1% sample of public Tweets in real-time

## Academic Research product track

This track includes:

- Ability to get historical Tweets from the entire archive of public conversation on Twitter, dating back to 2006 (using the full-archive search endpoint)
- Higher monthly Tweet volume cap of 10 million Tweets per month
- More advanced filter options to return relevant data, including a longer query length, support for more concurrent rules (for filtered stream endpoint), and additional operators that are only supported in this product track (more on this later)

For a complete list of available endpoints in the V2 API, check out the [Twitter API documentation](https://developer.twitter.com/en/docs/twitter-api).

For a complete course on accessing the Twitter V2 API, I suggest you take a look the [101 course](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research) prepared by Twitter API team.

## Twitter Data: Presidential Elections in Brazil. 

To collect Twitter data from the V2 API, we will use the [`academictwitteR`](https://github.com/cjbarrie/academictwitteR) package developed by Chris Barrie. For R users, this is an amazing package because it allows you to easily query the API, and process the data in easily readable format for R. 

If you prefer Python, I suggest you to check the library [Twarc](https://twarc-project.readthedocs.io/en/latest/). 


### Access tweets from the archive

Let's start collecting some data using a textual query through the search endpoint. The  academic access to the V2 API allows you to query the Twitter archive, and get access to data way back on time. 

Let's start collecting some data about the recent presidential elections in Brazil. 


```{r}
# Call packages using pacman
#install.packages("pacman")
pacman::p_load(here, jsonlite, tidyverse, academictwitteR)
```


```{r, eval=FALSE}
# Using Academic Twitter to add yourkey
set_bearer()

# Collect data
tweets <-
  get_all_tweets(
    query = "(eleicoes2022 OR lula OR bolsonaro OR ciro OR tebet)", # query
    start_tweets = "2022-10-01T00:00:00Z", #start time
    end_tweets = "2022-10-04T00:00:00Z", #end time
    file = "br_elections", # file to save
    data_path = "data_br/", # folder where all data as jsons will be stores
    n = 200000, # number of tweets
    lang = "pt",
  )
```

This data is stored as a series of smaller jsons. The `academictwitter` has a specific function to easily combine these json files in a single file in the tidy format. 

If you are collecting this through other packages or accessing the API directly, you would get long json files as responses. Jsons are basically a set of nested lists, and can be tricky to clean. So the `bind_tweets` function can be very handy

Another option, which is very common if you have a consistent data pipeline, is to build your own cleaning function, getting the data and variables in the format your project needs. 
```{r eval=FALSE}
# data processing
tweets_tidy <- bind_tweets("./data_br", output_format = "tidy")
glimpse(tweets_tidy)
```


```{r echo=FALSE}
# data processing
tweets_tidy <- read_rds("tweets_tidy.rds")
glimpse(tweets_tidy)
```

A lot of data. But only a portion of what comes through the API. So you can actually process the whole data here. This is a really nice feature of this package. The json data is stored in smaller pieces, which makes it easier for you to process later. 

```{r eval=FALSE}
# examing the data
tweets_raw <- bind_tweets("./data_br", output_format = "raw")
str(tweets_raw, max.level=1)

```

```{r echo=FALSE}
# data processing
tweets_raw <- read_rds("tweets_raw_tidy.rds")
str(tweets_raw, max.level=1)
```


### Network Analysis with Twitter Data

There are many different ways you can analyze Twitter data. You can analyze the text, the images, the geolocations, the links shared, among many other things. 

A favorite way for computational social scientists to analyze social media data is to look at the user connections using some sort of network models. This is not limited to Twitter data. Pretty much any social media application comprises some sort of network structure. 

So let's start with some basics of network analysis in R. 

A network has two core elements: nodes and edges. On Twitter this means:

- Nodes are Twitter users

- Edges are any sort of connections these users make. A reply, a friendship, or the most common, a retweet. 

We will be using retweets and quote tweets to give some examples of network analysis in R. We will use the `igraph` package to analyze network data in R. This package has many different features, but it stores data in a very distinct way. So let's work through it. 


### Bulding a Network

### Step 1: Filter Nodes


```{r}
# Filter retweets
tweets_tidy_rt <- tweets_tidy %>%
                          filter(!is.na(sourcetweet_type))

dim(tweets_tidy_rt)

# Visualize the dta
tweets_tidy_rt %>%
  select(user_username, sourcetweet_author_id) %>%
  head()

```


### Step 2: Create a edge list

```{r}
# Create a edge list 
# using the user id on both sides here to keep the same unit

data <- cbind(tweets_tidy_rt$author_id, tweets_tidy_rt$sourcetweet_author_id)
dim(data)
head(data)

```

Notice here we have two different types of users. Hubs are the users who retweet, and authorities are the user who receive a retweets. 

### Step 3: Create your network structure

```{r}
pacman::p_load(igraph)

# Create an empty network

net <- graph.empty() 

# Add nodes
net <- add.vertices(net, 
                    length(unique(c(data))), # number of nodes
                    name=as.character(unique(c(data)))) # unique names

# Add edges
net <- add.edges(net, t(data))

# summary
summary(net)

```

Your output:

- Igraph object
- 79886 unique nodes
- 154583 edges

### Step four: Add information to your network object

Information comes in two flavors. Information at the edge level (`E(object)`) and at the Node leve (`V(object)`). Let's see how it works. 

```{r}
library(urltools)

# Edges 
E(net)$text <- tweets_tidy_rt$text
E(net)$idauth <- tweets_tidy_rt$sourcetweet_author_id
E(net)$namehub <- tweets_tidy_rt$user_username


# Capturing hashtags
E(net)$hash <- str_extract_all(tweets_tidy_rt$text, "#\\S+")


# grab expanded and unwound_url
entities <- tweets_raw$tweet.entities.urls

tidy_entities <- entities %>% 
                    # get columns we need
                    select(tweet_id, unwound_url) %>% 
                    #extract domains
                    mutate(unwound_url=domain(unwound_url)) %>%
                    # remove nas and combine multiple links
                    filter(!is.na(unwound_url)) %>%
                    group_by(tweet_id) %>%
                    summarise(domain=paste0(unwound_url, collapse=" -- "))

# Merge back with id
tweets_tidy_rt <- left_join(tweets_tidy_rt, tidy_entities)

# add to the network
E(net)$domain <- tweets_tidy_rt$domain
```


## Network Statistics, Communities and Layout

Two very common concepts in network science are in-degree and out-degree. In-degree refers to how many links pointing to themselves the user has. So in our case it shows how many retweets this user has received. The opposite explains the out-degree. In this case, out-degree means how many retweets the user has given.

A user is called an authority when their in-degree is high. That is, this user received many retweets from others. We call it a hub when its out-degree is high, as this user retweets very often.

Accounst who are considered bots usually have a huge difference between in-degree and out-degree -- nobody retweets them, they just retweet a lot, and usually really quick.

```{r}
# Calculate in degree and out degree
V(net)$outdegree<-degree(net, mode="out")
V(net)$indegree<-degree(net, mode="in")
summary(net)
```

### Layout

Networks are always in the latente space. To visualize them, we usually reccur algorithms that maximize some dynamics of networks and give us layouts for visualization. You can try out different algorithm, but the Fruchterman-Reingold is a popular choice

```{r eval=FALSE}
l <- layout_with_fr(net, grid = c("nogrid"))
#saveRDS(l, "layout.rds")
head(l)
```

```{r echo=FALSE}
l <- read_rds("layout.rds")
head(l)
```

### Communities


Community detection is a big part of network analysis. The idea of these techniques is to find clusters of connections across your entire network space. Community detection is a huge subfield of network science. Here is a nice review piece by [Porter et al](https://www.ams.org/notices/200909/rtx090901082p.pdf). 

My take here is that for large networks, like we usually deal with in social media, most of the algorithms will do the job you need, which in general is identify the core communities in a network. An important point is to always validate these communities, and use some qualitative analysis to verify the results. 

We will use an random walk algorithm for community detection

```{r eval=FALSE}
my.com.fast <- walktrap.community(net)
str(my.com.fast, max.level = 1)
```

```{r echo=FALSE}
my.com.fast <- read_rds("community_detection.rds")
str(my.com.fast, max.level = 1)
```


# Add the layout and membership to your igraph object. 

```{r}
V(net)$l1 <- l[,1]
V(net)$l2 <- l[,2]
V(net)$membership <- my.com.fast$membership
```

## What are the largest communities?

```{r}

comunidades<- data_frame(membership=V(net)$membership)

comunidades %>% 
    count(membership) %>%
    arrange(desc(n)) %>%
    top_n(5)


```


## Who are the main authorities in each community?


```{r autoridade}
# Create an datafram for the authoritiew
authorities <- data_frame(name=V(net)$name, ind=V(net)$indegree, 
                         membership=V(net)$membership) %>%
                          filter(membership==11| membership==4|membership==8) %>%
                          group_by(membership) %>%
                          arrange(desc(ind)) %>% 
                          slice(1:10)

```

The V2 API does not return the name of the author of the original tweet. We can collect this information using the function `get_user_profile()` 


```{r}
# I will get only from the 100 most retweeted to save some time. 
users_most_retweets <-authorities %>%
                      mutate(data_user=map(name, get_user_profile)) %>%
                      unnest()


# Main Communities
ggplot(users_most_retweets %>% filter(membership=="4"), aes(x=reorder(username,
                               ind, 
                               fill=membership),
                     y=ind)) + 
    geom_histogram(stat="identity", width=.5, color="black") +
    coord_flip() +
    xlab("") + ylab("") + 
    theme_minimal(base_size = 12) + 
    theme(plot.title = element_text(size = 22, face = "bold"), 
          axis.title=element_text(size=16), 
          axis.text = element_text(size=12, face="bold")) +
    facet_grid(~membership)

# Main Communities
ggplot(users_most_retweets %>% filter(membership=="11"), aes(x=reorder(username,
                               ind, 
                               fill=membership),
                     y=ind)) + 
    geom_histogram(stat="identity", width=.5, color="black") +
    coord_flip() +
    xlab("") + ylab("") + 
    theme_minimal(base_size = 12) + 
    theme(plot.title = element_text(size = 22, face = "bold"), 
          axis.title=element_text(size=16), 
          axis.text = element_text(size=12, face="bold")) +
    facet_grid(~membership)

# Main Communities
ggplot(users_most_retweets %>% filter(membership=="8"), aes(x=reorder(username,
                               ind, 
                               fill=membership),
                     y=ind)) + 
    geom_histogram(stat="identity", width=.5, color="black") +
    coord_flip() +
    xlab("") + ylab("") + 
    theme_minimal(base_size = 12) + 
    theme(plot.title = element_text(size = 22, face = "bold"), 
          axis.title=element_text(size=16), 
          axis.text = element_text(size=12, face="bold")) +
    facet_grid(~membership)
```

## Visualizing communities

```{r}

# A function with the density. Nice to visualize as well.
my.den.plot <- function(l=l,new.color=new.color, ind=ind, legend, color){
  library(KernSmooth)
  est <- bkde2D(l, bandwidth=c(10, 10))
  plot(l,cex=log(ind+1)/4, col=new.color, pch=16, xlim=c(-160,140),ylim=c(-140,160), xlab="", ylab="", axes=FALSE)
   legend("topright", c(legend[1],legend[2], legend[3]), pch = 17:19, col=c(color[1], color[2], color[3]))
  contour(est$x1, est$x2, est$fhat,  col = gray(.6), add=TRUE)
  #text(-140,115,paste("ENCG: ",ENCG,sep=""), cex=1, srt=0)
} 

# Add colors for each community
# Building a empty container
temp <- rep(1,length(V(net)$membership))
new.color <- "white"
new.color[V(net)$membership==11] <- "Yellow" ####
new.color[V(net)$membership==8] <- "pink" ####
new.color[V(net)$membership==4] <- "red" ####

# Save as a variable in the network object
V(net)$new.color <- new.color

# Plot
my.den.plot(l=cbind(V(net)$l1,V(net)$l2),new.color=V(net)$new.color, ind=V(net)$indegre, legend =c("Pro-Bolsonaro", "Anti-Bolsonaro I", "Anti-Bolsonaro II"), 
color =c("Yellow", "red", "pink"))

```


## Hashtags by communities

### Most Popular Hashtags

```{r}
hashtags <- tweets_tidy_rt %>% 
            mutate(hashtags=str_extract_all(tweets_tidy_rt$text, "#\\S+")) %>%
            unnest(hashtags) %>%
            count(hashtags) %>%
            arrange(desc(n)) %>% 
            slice(1:30) %>%
            drop_na()
          

# Contando as hashtags
ggplot(hashtags, aes(x=reorder(hashtags,
                               n),
                     y=n)) + 
    geom_histogram(stat="identity", width=.5, color="black", 
                   fill="steelblue") +
  coord_flip() +
    xlab("") + ylab("") + 
  theme_minimal(base_size = 12) + 
  theme(plot.title = element_text(size = 22, face = "bold"), 
        axis.title=element_text(size=16), 
        axis.text = element_text(size=12, face="bold")) 


```

### Heterogenous activation of Hashtags 


```{r}

# get communities
auth <- data_frame(name=V(net)$name, membership=V(net)$membership)

# merge

data_hash <- tweets_tidy_rt %>%
                    left_join(auth, by=c("author_id"="name")) %>%
                    mutate(hashtags=str_extract_all(tweets_tidy_rt$text, "#\\S+")) %>%
                    unnest(hashtags) %>%
                    drop_na(hashtags)
      

# Vamos aggrupar por comunidade
hash_by_comm <- data_hash %>%
                           filter(membership==11| membership==8|membership==4) %>%
                           count(membership, hashtags) %>%
                           top_n(20, n)

ggplot(hash_by_comm, aes(x=reorder(hashtags,
                               n),
                     y=n, fill=as.factor(membership))) + 
  geom_histogram(stat="identity", width=.5, color="black") +
  coord_flip() +
    xlab("") + ylab("") + 
  scale_fill_manual(name="Communities", values=c("Red", "pink", "Yellow")) +
  theme_minimal(base_size = 10) + 
  facet_wrap(~membership, scale="free") 
```


## Sharing news on Twitter

Let me present you with one final analysis of Twitter data. It comes from a paper I worked together with Ernesto Calvo and Natalia Aruguete, [News Sharing, Gatekeeping, and Polarization: A Study of the #Bolsonaro Election](https://www.tandfonline.com/doi/full/10.1080/21670811.2020.1852094). The paper proposes an model to estimate behavioral parameters for news sharing with social media data. But, before we get to the model, we show descriptive how news organizations are activated differently by each community on Twitter. This is a interesting way to visualize the formation of echo chambers on Twitter coming straight from activation of content, instead of sorting. 

A lot of this code was developed by Ernesto Calvo. Thanks, Ernesto!

#### Counting news domains at the node level

```{r}
summary(net)

# Vamos primeiro selecionar as mídias mais ativadas
keynews <- head(sort(table(unlist(E(net)$domain)),decreasing=TRUE),20)
keynews.names <- names(keynews)

N<-length(keynews.names)
count.keynews<- array(0,dim=c(length(E(net)),N))
str(count.keynews)

# Looping 
for(i in 1:N){
  temp<- grepl(keynews.names[i], E(net)$domain, ignore.case = TRUE)
  #temp <- str_match(E(net)$text,'Arangur[A-Za-z]+[A-Za-z0-9_]+')
  count.keynews[temp==TRUE,i]<-1
  Sys.sleep(0.1)

  }


# Setting the names of the media
colnames(count.keynews)<- keynews.names
count.keynews[1:5, 1:3]
```

#### Recovering the neighhood matrix for nodes

```{r, eval=FALSE}

# Recovering all nodes connections
el <- get.adjedgelist(net, mode="all")
al <- get.adjlist(net, mode="all")
```

```{r, echo=FALSE}
# Recovering all nodes connections
el <- read_rds("el.rds")
al <- read_rds("al.rds")

```

#### Function to estimate the propagation of content at edge level

```{r}

#Function to detect the spread in the adj list
fomfE<- function(var=var, adjV=adjV,adjE=adjE){
  stemp <- map_dbl(adjE, function(x) sum(var[x]))
  #mstemp <- sapply(adjV, function(x) mean(stemp[x]))
  out<-cbind(stemp)
}

# container
result_hash<- array(0,dim=c(length(V(net)),N))
# Repita para cada uma das hashtags
for(i in 1:N){
  bb<-fomfE(count.keynews[,i],al,el)
  bb[bb[,1]=="NaN"]<-0
  result_hash[,i]<- bb[,1]
}
colnames(result_hash)<- keynews.names

```


#### Visualization


```{r}

par(mfrow=c(2,2))

# Hashtag 
plot(V(net)$l1,V(net)$l2,pch=16,  
       col=V(net)$new.color, 
       cex=log(result_hash[,1]+1),
       xlim=c(-100,150),ylim=c(-100,150), xlab="", ylab="",
       main=colnames(result_hash)[1], cex.main=1)

# Hashtag
plot(V(net)$l1,V(net)$l2,pch=16,  
       col=V(net)$new.color, 
      cex=log(result_hash[,7]+1),
       xlim=c(-100,150),ylim=c(-100,150), xlab="", ylab="",
       main=colnames(result_hash)[7], cex.main=1)
# Hashtag 
plot(V(net)$l1,V(net)$l2,pch=16,  
       col=V(net)$new.color, 
       cex=log(result_hash[,9]+1),
       xlim=c(-100,150),ylim=c(-100,150), xlab="", ylab="",
       main=colnames(result_hash)[9], cex.main=1)

# Hashtag
plot(V(net)$l1,V(net)$l2,pch=16,  
       col=V(net)$new.color, 
       cex=log(result_hash[,10]+1),
       xlim=c(-100,150),ylim=c(-100,150), xlab="", ylab="",
       main=colnames(result_hash)[10], cex.main=1)

```

```{r, echo=FALSE}
dev.off()
```

#### Extra notes

With the matrix of users and links, we can do many things. For example, in this paper [News by Popular Demand](https://journals.sagepub.com/doi/abs/10.1177/19401612211057068) which I also work together with Ernesto Calvo and Natalia Aruguete, we show how to estimate the importance of ideology, reputation, and attention on sharing behavior on Twitter. 

In this working paper by [Eady et al](https://osf.io/ch8gj/), the same matrix is used to estimate ideal points for users and media organization from Twitter. The authors use a sophisticated Bayesian count model to estimate the ideal points. But, as you can see from this paper by [Pablo Barbera](http://pablobarbera.com/static/PSS-supplementary-materials.pdf), one can get to very similar results using a simple singular value decomposition on a matrix of normalized residuals.  

## Other APIs endpoint

Most of our work with the Twitter API happens with the capacity to query the API with search terms. For this reason, the search (and filter for live data collection) endpoints are the most popular. 

However, there are a few other endpoints from the Twitter API that can also be very useful for research puporses. Let's walk through them briefly. 

## Getting user id

Imagine a research in which you have the Twitter accounts of elites, and you want to collect their Twitter data. The first step is to collect their ids. 

```{r}
# getting some Twitter Ids
pelosi <- get_user_id("SpeakerPelosi")
```

## Getting whom a user follows

```{r}
pelosi_network <- get_user_following(pelosi)
glimpse(pelosi_network)
```

## Estimate user ideology


```{r}
#devtools::install_github("pablobarbera/twitter_ideology/pkg/tweetscores")
library(tweetscores)
results <- estimateIdeology("SpeakerPelosi", pelosi_network$id, verbose = FALSE)
plot(results)
```

## User timeline

```{r}
pelosi_tl = get_user_timeline(pelosi,
                              start_tweets = "2022-01-01T00:00:00Z", 
                               end_tweets = "2022-10-22T00:00:00Z", 
                              n=100) #limit
glimpse(pelosi_tl)
```

## Tweets liked by an user

```{r}
pelosi_likes = get_liked_tweets(pelosi) #limit
glimpse(pelosi_likes) # she mostly liked her own tweets

```