-
Notifications
You must be signed in to change notification settings - Fork 0
/
apis.qmd
429 lines (320 loc) · 31.5 KB
/
apis.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
# Web APIs {#sec-apis}
```{r setup}
#| echo: false
#| include: false
source("_common.R")
```
In @sec-corpora, we listed numerous placed to find datasets online. The sorts of datasets discussed there are generally formatted as a .csv file---or something like it---which you can download directly to your computer and import into R using `read_csv()`. Lots of online data, however, are not pre-packaged in this way. When datasets are very large or complex, or are being updated regularly, the hosts of the data will instead provide a **web API**.
You can think of web APIs as little programming languages that are written inside a URL (URLs are the things you write in the top of your web browser to tell it which web page you want to access). When you access your custom-written URL, the host will perform whatever computations are necessary to get you your data, and send it to you in the same way that it would send you a website.
Web APIs are sometimes difficult to use because each one works differently. Learning to use a new API can be like learning a new programming language. For this reason, web APIs sometimes have associated **wrappers**, packages that allow you to communicate with the API in a more familiar format.
In the first part of this chapter, we introduce the [`vosonSML` package](https://github.com/vosonlab/vosonSML), an R-friendly wrapper that provides easy access to the Reddit, Twitter, Mastadon, and YouTube APIs. At the end of the chapter, we will discuss web APIs that do not have a convenient wrapper in R.
## API Basic Concepts
- **Requests:** Each time you visit a URL associated with an API, you are submitting a *request* for data.
- **Endpoints:** Every API has at least one *endpoint*, a contact point for particular types of requests.
- **Rate Limits:** Many APIs set limits on the number of requests you can make per minute (or per second). This is because processing requests costs time and money for the host. If you go beyond the rate limit, the API will return an error like *"429 Too Many Requests."*
- **Authentication:** Some APIs are not open to the public, instead requiring users to apply for access or pay for a subscription. When accessing these APIs, you need an *API key* or an *access token*. This is your password for the API.
### vosonSML
```{r}
library(vosonSML)
```
For accessing social media APIs with `vosonSML`, you only need two functions:
- `Authenticate()` creates a credential object that contains any keys or access tokens needed to access a particular API. This credential object can be reused as long as your credentials don't change.
- `Collect()` initiates a series of API requests and stores the results as a dataframe or list of dataframes.
`vosonSML` also provides tools for working with network data (i.e. the ways in which users or posts are connected to one another), but these will not be covered in this textbook.
## Reddit
Reddit [generated over 3 billion posts and comments in 2022](https://www.statista.com/statistics/1319008/reddit-content-created/). Many of these contain long-form text. And its API is free. These traits make it very useful to researchers.
Reddit content exists on three levels:
- **Communities**, called "subreddits" are spaces for users to post about a specific topic. Individual subreddits are referred to as "r/SUBREDDIT". For example, [r/dataisbeautiful](https://www.reddit.com/r/dataisbeautiful/) is for data visualizations, [r/NaturalLanguage](https://www.reddit.com/r/NaturalLanguage/) is for posts about natural language processing, and [r/SampleSize](https://www.reddit.com/r/SampleSize/) is a place to gather volunteer participants for surveys and polls. Communities are policed by *moderators*, users who can remove posts or ban other users from the community.
- **Posts** are posted by users to a particular subreddit. Each post has a **title**, which is always text, and **content**, which can contain text, images, and videos.
- **Comments** are responses to posts, responses to responses to posts, responses to responses to responses to posts, etc. These are always text.
### The Reddit Algorithm
Reddit data are not representative samples of the global population. They are not even representative samples of Reddit users. This is partly due to the dynamics of the Reddit ranking algorithm, which gives precedent to viral content. The details of the algorithm are [no longer public](https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9), but it is largely based on "upvotes" and "downvotes" from the community, and probably also incorporates the time since posting. The ranking system for comments is almost certainly different from the ranking system for posts. Reddit also has a "Karma" system, by which users who post popular content get subsequent content boosted. This creates an incentive system which is [sometimes exploited by the advertising industry](https://www.reddit.com/r/ModSupport/comments/162p502/why_is_reddit_doing_nothing_to_handle_the_obvious/). The bottom line: **Reddit posts are disproportionately viral**. To partially counteract this when gathering Reddit data, set the API to sort by recency (`sort = "new"`) rather than the default, "best" ([@sec-reddit-threads]).
### Communities
To retrieve the posts from a Reddit community, call `Authenticate("reddit")` followed by `Collect()` with `endpoint = "listing"`. To specify the particular community (or communities) in which you are interested, use the `subreddits` parameter. For example, the following code retrieves the 20 most recent posts from r/RedditAPIAdvocacy, a subreddit devoted to fighting restriction of the Reddit API.
```{r}
#| warning: false
APIAdvocacy_posts <-
Authenticate("reddit") |>
Collect(
endpoint = "listing",
subreddits = "RedditAPIAdvocacy",
sort = "new", # newest posts first
period = "all", # all time
max = 20, # 20 most recent posts
verbose = TRUE
)
head(APIAdvocacy_posts)
```
To psychology researchers, the most interesting subreddits are often those devoted to psychological disorders, for example r/depression, r/socialanxiety, r/SuicideWatch, r/bipolarreddit, and r/opiates. Subreddits devoted to intellectual discussion, such as r/changemyview, r/IAmA, and r/ExplainLikeImFive, are also interesting subjects for research. Lastly, much research is devoted to the behavior of Redditors in political communities.
**An example of using Reddit's political communities in research:** In a series of experiments, @ashokkumar_pennebaker_2022 used the Reddit API and other sources to develop a dictionary-based measure of group identity strength. A dictionary is a list of words associated with a given psychological or other construct, for example associating "sleepy" and "down" with depression (see [@sec-word-counting]). Ashokkumar and Pennebaker had participants write small, free-form responses and fill out questionnaires on group identity strength. They then identified existing dictionaries that were associated with the questionnaire-based measures, and used these to construct a composite measure of group identity strength. When applied to text, they called this measure "unquestioning affiliation." Among college students, they showed that unquestioning affiliation in writing could predict whether students would drop out of college one year later. Finally, they applied their method to naturalistic data retrieved from the Reddit API, from the political communities r/The_Donald and r/hillaryclinton, and showed that users' unquestioning affiliation predicted the duration that they would stay in the community before leaving.
### Threads {#sec-reddit-threads}
A post with all of its associated comments is called a **thread**. For example, Hadley Wickham, the founder of [the tidyverse](https://www.tidyverse.org), ran a thread on the r/dataisbeautiful subreddit in 2015, in which he answered commenters' questions. To retrieve that thread, first find its URL. Do this by finding the threat on Reddit and copying the link from your web browser. Then call `Authenticate("reddit")` and `Collect()`, like so:
```{r}
#| warning: false
# List of thread urls (in this case only one)
threads <- c("https://www.reddit.com/r/dataisbeautiful/comments/3mp9r7/im_hadley_wickham_chief_scientist_at_rstudio_and/")
# Retrieve the data
## Since the Reddit API is open, we don't need
## to give any passwords to Authenticate()
hadley_threads <-
Authenticate("reddit") |>
Collect(
threadUrls = threads,
sort = "new", # newest comments first
verbose = TRUE # give updates while running
)
# Peak at Hadley's responses
hadley_threads |>
filter(user == "hadley") |>
select(structure, comment) |>
head()
```
The resulting dataframe has many columns. The most useful are the following:
- `comment` is the text of the comment itself.
- `user` is user who posted the comment.
- `structure` is the tree structure leading to the comment. For example, "137_6_2" is the 2nd comment on the 6th comment on the 137th comment on the original post.
- `comm_date` is the date and time of the comment, in UTC. Since this is in character format, it needs to be converted to a datetime with `lubridate::as_datetime()`.
By processing the `structure` values, we can conceptualize the thread as a tree, with the original post as the root and comments as branches:
```{r}
#| warning: false
library(ggraph)
hadley_threads |>
mutate(
level = str_count(structure, "_") + 1L,
parent = str_remove(structure, "_[[:digit:]]+$"),
parent = if_else(level == 1, "0", parent)
) |>
select(parent, structure) |>
tidygraph::as_tbl_graph() |>
ggraph(layout = 'tree', circular = FALSE) +
geom_edge_diagonal(alpha = .2, linewidth = 1) +
geom_node_point(shape = 21, fill = "orangered") +
theme_void()
```
**An example of using Reddit threads in research:** @xiao_mensah_2022 collected threads from r/changemyview, a community in which the original poster (OP) makes a claim, commenters make arguments against that claim, and the OP responds with "delta points" to indicate how much their view has been changed. @xiao_mensah_2022 analyzed the frequency of delta points at each level of the thread tree, and found that the OP's view tended to change most after the 2nd, 4th, 6th, 8th, and 10th levels of comments---in other words, every other level. They then analyzed the semantic similarity between comments using cosine similarity ([@sec-cosine-similarity]) between simple word counts. The results suggested that every-other-level comments tend to elaborate and refine the comments immediately before them, so that the latter are perceived to be more persuasive.
### Other Reddit Features
For most applications, `vosonSML` is a sufficient Reddit API wrapper. For slightly more advanced functionality like searching for subreddits with a keyword or retrieving a user's comment history, we recommend [Reddit Extractor](https://github.com/ivan-rivera/RedditExtractor).
::: {.callout-tip icon="false"}
## Advantages of Reddit Data
- **Explicit Communities:** Reddit communities are clearly defined and explicit about their purposes. Reddit includes communities devoted to fictional storytelling, factual storytelling, personal reflection, technical advice, political discussion, joke telling, and much more. This makes it easy to gather a domain-specific sample.
- **Long-form Text Responses:** While some social media platforms have character limits for posts or comments, Reddit has many communities devoted to long-form text. Longer documents can make characterization of their content more accurate.
- **Anonymity:** Reddit users can remain relatively anonymous, which might encourage more honest and open sharing of experiences.
:::
::: {.callout-important icon="false"}
## Disadvantages of Reddit Data
- **Selection Bias:** Certain subreddits may attract specific demographics, leading to potential selection bias in the data.
- **Anonymity:** In some cases, anonymity may make user behavior less representative. For example, many users have multiple accounts, which they use for different activities on the platform, making users seem disproportionately narrow in their interests.
:::
## Twitter / X
Twitter has been called the "model organism" of big data research [@tufekci_2014]. This is because, until 2023, Twitter was the largest source of free, open social text data. In contrast to Facebook, almost all Twitter activity is public. Twitter's standardized character limit (originally 140, now 280), along with its simple network structure, make structuring and analyzing the data easy.
### Followers, Mentions, and Retweets
Whereas Reddit users subscribe to communities, Twitter users subscribe to other users. This is called "following." They can also "retweet" other users' posts to their own feed, or explicitly mention other users in posts using the "\@" symbol. This focus on connections between individuals means that each user's communal associations can be characterized in a rich, nuanced way.
**An example of using Twitter networks in research:** @mosleh_etal_2021 collected Twitter IDs of participants who filled out a set of questions with intuitively compelling but incorrect answers, testing analytic thinking. They then analyzed the Twitter accounts that each participant followed, and found a large cluster of accounts followed preferentially by people with less analytic thinking. This group most prominently featured Twitter accounts of retail brands.
### Geolocation on Twitter
Twitter allows users to specify a geographical location along their post. While these user-provided locations can sometimes be difficult to interpret, they allow analyses by geographic location.
**An example of using Twitter geolocation in research:** @takhteyev_etal_2012 used Twitter's public feed to gather a sample of users. Researchers also collected the profiles of all the users followed by the users in the original sample. They then analyzed the distribution of these ties over geographical distance, national boundaries, and language differences. They found that Twitter users are most strongly tied to people in their own city, and that between regions, the number of airline flights was the best predictor of ties.
::: {.callout-tip icon="false"}
## Advantages of Twitter Data
- **Character Limit:** Twitter's character limit means that Tweets vary less in their length. This can decrease variance in measurements.
- **Pure Network Structure:** As opposed to Reddit's distinct communities and hierarchical comment trees, Twitter has a network-like social structure in which users are linked to other individual users or posts. This structure can reveal underlying communities and the strength of the connections between them.
:::
::: {.callout-important icon="false"}
## Disadvantages of Twitter Data
- **Character Limit:** The character limit on tweets may limit the depth of responses and the ability to convey complex psychological experiences. The smaller sample of words in each document can also make measurement less reliable.
- **Limited Context:** Tweets may lack context, making it challenging to fully understand the meaning behind short messages.
- **No more free API:** While lots of historical Twitter data are available on the internet (see @sec-corpora), [the API has been prohibitively expensive since 2023](https://www.wired.com/story/twitter-data-api-prices-out-nearly-everyone/).
:::
## Mastodon
In some ways, you can think of Mastodon as open source Twitter. Indeed, a large portion of the Mastodon user base started using it as a replacement for Twitter after [controversial events in 2022](https://en.wikipedia.org/wiki/Mastodon_(social_network)#2022_Twitter-related_spikes_in_adoption). Like Twitter users, Mastodon users post short posts to their feed, in which they can mention other users with "\@" or use hashtags with "\#." Like Twitter users, Mastodon users can follow other users and be followed by them.
Despite their similarities, networks of users on Mastodon and Twitter are not the same. The biggest difference is that each Mastodon user is associated with a *server* (also known as an *instance*). Mastodon servers can have hundreds of thousands of users, or just a few. Each server runs the Mastodon source code independently and hosts all of its users' content. Many servers have a theme based on a specific interest. It is also common for servers to be based around a particular locality, region, ethnicity, or country. For example, [tech.lgbt](https://tech.lgbt) is for LGBTQIA+ technologists, [SFBA.social](https://sfba.social) is for the San Francisco Bay Area in California, and [fediscience.org](https://fediscience.org) is for active scientists.
Because of their topic-specific organization, Mastodon servers can function a bit like Reddit communities. Like Reddit communities, Mastodon servers can set their own rules for content. They can even change the default 500-character limit. The key difference is that on Reddit, *posts* are associated with a community, whereas on Mastodon, *users* are associated with a community. In other words, community affiliation on Mastodon is less related to the content of its posts and more related to the identity of its users. This feature can be useful when collecting user activity. For example, if we wanted to study the overall behavior of members of the LGBTQIA+ community, we could collect the activity of users affiliated with the [tech.lgbt](https://tech.lgbt) and other community-specific servers.
### Servers
Mastodon server feeds are organized chronologically (unlike Reddit or Twitter, which have complex ordering algorithms). To retrieve the most recent posts on a particular server using `vosonSML`, we call `Authenticate("mastodon")` followed by `Collect()` with the "search" endpoint. The name of the server is identified with "instance."
```{r}
#| eval: false
mast_science <-
Authenticate("mastodon") |>
Collect(
endpoint = "search",
instance = "fediscience.org",
local = TRUE,
numPosts = 100, # number of most recent posts
verbose = TRUE
)
```
The global feed (i.e. across all servers) can be accessed by setting `local = FALSE`. You can also retrieve posts with a specific hashtag using the `hashtag` parameter (e.g. `hashtag = "rstats"`).
In all cases, the call results in a list of two dataframes, "posts" and "users." In the "posts" dataframe we have columns such as:
- `content.text` is the text of the post itself.
- `account` is a column of one-row dataframes containing information about the user who posted each post. You can convert this into a simple account ID column with `mutate(account_id = sapply(account, function(c) pull(c,id)))`.
- `id` is a unique ID for the post.
- `in_reply_to_id` is the ID of the post that this post is directly responding to. This can be used to build tree structures, as we did in @sec-reddit-threads, but see @sec-mastodon-threads for an alternative way to do this.
- `in_reply_to_account_id` is the account ID of the poster to whom this post is directly responding. This can be used to build networks of users as opposed to networks of threads.
- `created_at` is the date and time of the post in UTC, already in datetime format.
The "users" dataframe has the same information as the `account` column of "posts", but formatted as a single dataframe with one row per user.
### Threads {#sec-mastodon-threads}
Since Mastodon users can mark their posts as a reply to an earlier post, these posts are essentially "comments" on the original post. A chain of posts that reply to an original post is called a thread. We can collect the contents of one or more threads with `endpoint = "thread"`, by specifying the URL of the first post in the thread with `threadUrls`. For example, we can retrieve [a thread by one the authors of this book, Almog](https://fediscience.org/@almogsi/110513658753942590):
```{r}
#| output: false
almog_thread <-
Authenticate("mastodon") |>
Collect(
endpoint = "thread",
threadUrls = c("https://fediscience.org/@almogsi/110513658753942590"),
verbose = TRUE
)
almog_thread$posts |>
select(created_at, content.text) |>
head()
```
### Other Mastodon Features
For most applications, `vosonSML` is a sufficient Mastodon API wrapper. For slightly more advanced functionality like searching accounts or retrieving a user's follows and followers, we recommend [rtoot](https://gesistsa.github.io/rtoot/).
::: {.callout-tip icon="false"}
## Advantages of Mastodon Data
- **Free:** Because Mastodon's servers are decentralized (i.e. there is no single company controlling its operation), there is no concern that its data will become unavailable.
- **No Feed Algorithm:** Mastodon does not use a black-box algorithm to decide what users see on their feed. Users have total control over their own subscriptions (e.g. users who follow more accounts have more content on their feeds). This eliminates a possible confounding factor present in many social media platforms.
- **Domain-Specific Servers:** Users are associated with a particular server, often associated with an identity or region. This allows more targeted data collection.
:::
::: {.callout-important icon="false"}
## Disadvantages of Mastodon Data
- **Sampling Bias:** Because Mastodon is largely populated by users who fled from Twitter or Facebook, their demographics may be especially unrepresentative.
- **Smaller User Base:** Mastodon has a much smaller user base compared to major platforms like Twitter and Reddit.
:::
## YouTube
YouTube is structurally similar to Twitter in that neither users nor posts are explicitly associated with a community. Users subscribe directly to other users' channels. Each user can upload posts to their own channel. As on Reddit, these posts generally have a title, a description, and a space for comments.
YouTube is different from other social media platforms in a few ways. Most obviously, posts consist primarily of videos. This means that analyzing a post's description will not give an accurate representation of the post's content. Analyzing the content of the video directly is generally impossible, since YouTube does not allow videos to be downloaded through their API. One possible workaround uses YouTube's automatically generated video transcriptions---though YouTube does not allow access to full transcriptions for standard API key bearers, there are [some workarounds available](https://github.com/jdepoix/youtube-transcript-api). Nevertheless, we will limit ourselves here to text data. This primarily consists of comments on videos.
### Video Comments
Another important difference between YouTube and other social media platforms is in the structure of the comments. Whereas Reddit comments (or Twitter/Mastodon threads) can respond one to another, generating a deep tree-like structure, YouTube comments have only two levels of depth. In other words, comments either respond directly to a video (top-level) or they respond to a top-level comment (level two). This feature can sometimes make conversations hard to track, since level two comments may be responding to another level two comment, even if they are not explicitly marked as such. On the other hand, this feature may constrain YouTube comments to respond more directly to the video.
The interpretation of YouTube comments as direct responses to the original video is enhanced by the rich video format of the stimulus, which likely draws users in more than a text-only post would [cf. @yadav_etal_2011]. For this reason, YouTube can be a good platform for studying responses to shared stimuli in a social context.
To access the the YouTube API, you will need an API key. You can get an API key in a matter of seconds by [registering your project with Google cloud](https://www.youtube.com/watch?v=of0nzBJrT20). Once you have the key, you can access the API through `vosonSML` by calling `youtube_auth <- Authenticate("youtube", apiKey = "xxxxxxxxxxxxxx")`, followed by `Collect()`. In the code below, we'll collect the comments on [3Blue1Brown's excellent explanation of deep learning algorithms](https://youtu.be/aircAruvnKk?si=XKJeVfrB2GGrBx3H).
```{r}
#| output: false
# retrieve youtube_auth from separate R script
# (so as not to publicize my key)
source("~/Projects/sandbox/youtube_auth.R")
# this could also be a vector of URLs
video_url <- "https://www.youtube.com/watch?v=aircAruvnKk"
# collect comments
deep_learning_comments <-
youtube_auth |>
Collect(videoIDs = video_url,
maxComments = 100,
verbose = TRUE)
```
This call results in a dataframe with many columns, including the following:
- `Comment` is the text of the comment itself.
- `AuthorDisplayName` is the username of the commenter.
- `AuthorChannelUrl` is the URL for the commenter's own channel.
- `PublishedAt` is the date and time of the comment, in UTC. Since this is in character format, it needs to be converted to a datetime with `lubridate::as_datetime()`.
- `CommentID` is a unique ID for the comment.
- `ParentID` is the ID of the comment to which this comment is responding.
If we visualize these comments like we visualized the Reddit comments in @sec-reddit-threads, we see that the tree has only two levels:
```{r}
#| warning: false
set.seed(2)
deep_learning_comments |>
select(ParentID, CommentID) |>
tidygraph::as_tbl_graph() |>
ggraph::ggraph(layout = 'tree', circular = FALSE) +
ggraph::geom_edge_diagonal(alpha = .2, linewidth = 1) +
ggraph::geom_node_point(shape = 21, fill = "orangered") +
theme_void()
```
**An example of using YouTube comments in research:** @rosenbusch_etal_2019 collected comments from 20 vlogs each from 110 vloggers on YouTube, along with the transcripts of those videos. They used dictionary-based word counts (@sec-word-counting) to measure the emotional content of both video transcripts and video comments. Using a multilevel model, they found that both video- and channel-level emotional content independently predict commenter emotions.
::: {.callout-tip icon="false"}
## Advantages of YouTube Data
- **Rich Stimuli:** YouTube videos are extended and multimodal (i.e. they include both audio and visual stimuli). Video comments can be usefully construed as responses to that stimulus.
- **Constrained Structure:** YouTube limits its comment trees to two levels, which can simplify analyses.
:::
::: {.callout-important icon="false"}
## Disadvantages of YouTube Data
- **Missing Context:** YouTube comments respond to videos. Since videos are generally unavailable for download and are in any case of a different medium than text-based comments, measuring the relationship between a comment and the video it refers to can be difficult.
- **Constrained Structure:** Because YouTube limits its comment trees to two levels, complex referential structures can be difficult to decode.
:::
## Other Web APIs {#sec-other-apis}
In this chapter, we have covered four social media platforms and---with the exception of Twitter---how to access their APIs using `vosonSML`. Nevertheless, this has been far from an exhaustive list of social media platforms that provide APIs. [Many other sources of data exist](https://en.wikipedia.org/wiki/Comparison_of_microblogging_and_similar_services), each with their own advantages and disadvantages. Many of these do not have custom R wrappers for their APIs, but reading a little API documentation shouldn't deter you from finding your ideal dataset. The following are some of our favorite sources of psychologically interesting data accessible with public APIs.
- [StackExchange](https://stackexchange.com) is a network of question-and-answer sites like StackOverflow (for programmers), MathOverflow (for mathematicians), Arqade (for videogamers), Cross Validated (for statistics and machine learining), Science Fiction & Fantasy (for fans) and [many more](https://stackexchange.com/sites?view=list#users). StackExchange provides [a public API](https://api.stackexchange.com/docs) for retrieving questions, answers, and comments from their sites.
- [Wikipedia](https://www.wikipedia.org) provides [a public API](https://www.mediawiki.org/wiki/API:Main_page#Uses_for_the_MediaWiki_Action_API) for retrieving content, including [individial contributor data](https://www.mediawiki.org/wiki/API:Users).
- [Semantic Scholar](https://www.semanticscholar.org) is a website for navigating academic literature. It offers [a public API](https://www.semanticscholar.org/product/api) for accessing abstracts, references, citations, and other information about academic publications. The API also allows access to their contextualized semantic embeddings of publications (see @sec-contextualized-embeddings). Obtaining an API key requires a simple registration.
- Like Semantic Scholar, [CORE](https://core.ac.uk/services/dataset) offers access to full text articles and metadata for open access research papers. Obtaining access requires a simple registration.
- Facebook advertises a [research-oriented API](https://fort.fb.com/researcher-apis), which requires registration.
- Mastodon is not the only open source, decentralized social media platform with an API. [There are many](https://the-federation.info), though most are small.
To access one of these APIs through R, we recommend using the [`httr2`](https://httr2.r-lib.org) package. Most APIs send data in JSON format. As we mentioned in @sec-corpora, a JSON file is like a list in R, but formatted slightly differently. The `httr2` package also has functions for converting JSON to R objects. In the example below, we write a function to retrieve all questions with a particular tag from a particular site on StackExchange, and format them as a tibble.
```{r}
#| eval: false
library(httr2)
# StackExchange API search endpoint
endpoint <- request("https://api.stackexchange.com/2.2/search?")
# Function to retrieve posts with a given tag
get_tagged_posts <- function(tag, ...,
# default params
site = "stackoverflow",
pagesize = 100,
user = NULL,
order = "desc",
sort = "activity",
page = 1) {
# define request params
params <- list(
...,
order = order,
sort = sort,
tagged = tag,
site = site,
pagesize = pagesize,
page = page
)
# add user agent (this is polite for APIs)
req <- endpoint |>
req_user_agent(user)
# add query parameters
req <- req |>
req_url_query(!!!params)
# perform request + convert to list
resp <- req |>
req_perform() |>
resp_body_json()
# warn the user if API gave a "backoff" message,
if(!is.null(resp$backoff)){
warning("Received backoff request from API.",
"Wait at least ", resp$backoff, " seconds",
" before trying again!")
}
# Convert list to tibble,
# keeping only relevant variables
tibble::tibble(
title = sapply(
resp$items,
function(x) x$title
),
is_answered = sapply(
resp$items,
function(x) x$is_answered
),
view_count = sapply(
resp$items,
function(x) x$view_count
),
creation_date = sapply(
resp$items,
function(x) x$creation_date
),
user_id = sapply(
resp$items,
function(x) x$user_id
),
link = sapply(
resp$items,
function(x) x$link
)
)
}
# EXAMPLE USAGE: get 10 "teen"-related questions
# from the "parenting" StackExchange site
teen_posts <- get_tagged_posts(
"teen", site = "parenting",
user = "Data Science for Psychology (ds4psych@gmail.com)",
pagesize = 10
)
```
For further details on accessing APIs with `httr2` (including how to deal with rate limits), see [the vignette](https://httr2.r-lib.org/articles/wrapping-apis.html).
------------------------------------------------------------------------