-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
183 lines (131 loc) · 6.44 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
output:
md_document:
variant: gfm
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# icews
[![CRAN status](https://www.r-pkg.org/badges/version/icews)](https://cran.r-project.org/package=icews)
[![R-CMD-check](https://github.com/andybega/icews/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/andybega/icews/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/andybega/icews/branch/master/graph/badge.svg)](https://codecov.io/gh/andybega/icews?branch=master)
_Note: the ICEWS data were discontinued on 11 April 2023. You can still use this package to download the data from dataverse however._
Get the ICEWS event data from the Harvard Dataverse repos at [https://doi.org/10.7910/DVN/28075](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/28075) (historic data) and [https://doi.org/10.7910/DVN/QI2T9A](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QI2T9A) (weekly updates).
The icews package provides these major features:
- get the ICEWS event data without having to deal with Dataverse
- use raw data files (tab-separated variables, .tsv) or a database (SQLite3) or both as the storage backend
- set options so that in future R sessions icews knows where your data lives
- icews keeps the local data in sync with the latest versions on Dataverse
## Installation
Not on CRAN, so you will have to install via remotes or devtools:
```{r github-install, eval = FALSE}
library("remotes")
remotes::install_github("andybega/icews")
```
The **icews** package relies on the R [**dataverse** client](https://github.com/IQSS/dataverse-client-r). Note that this package requires a dataverse API token to correctly work. See the package README and https://guides.dataverse.org/en/latest/user/account.html#api-token.
## Usage
**tl;dr**: get a SQLite database with the current events on Dataverse with this code; otherwise read below for more details.
```{r, eval = FALSE}
Sys.setenv(DATAVERSE_KEY = "{api key}") # see ?dataverse_api_token
Sys.setenv(DATAVERSE_SERVER = "dataverse.harvard.edu")
library("icews")
library("DBI")
library("dplyr")
setup_icews(data_dir = "/where/should/data/be", use_db = TRUE, keep_files = TRUE,
r_profile = TRUE)
# this will give instructions for what to add to .Rprofile so that settings
# persist between R sessions
update_icews(dryrun = TRUE)
# Should list proposed downloads, ingests, etc.
update_icews(dryrun = FALSE)
# Wait until all is done; like 45 minutes or more the first time around
# The events will be in a table called "events".
# To get the data:
query_icews("SELECT count(*) AS n FROM events;")
# or
con <- connect()
DBI::dbGetQuery(con, "SELECT count(*) AS n FROM events;")
# or
con <- connect()
dplyr::tbl(con, "events") %>% summarize(n = n())
# or,
# read all 17+ million rows into memory
events <- read_icews()
```
### Files only
In the most simple use case, you can use the package to download the ICEWS event data, which comes in several tab-serparated value (TSV) files, without having to deal with Dataverse.
```{r, eval = FALSE}
Sys.setenv(DATAVERSE_SERVER = "dataverse.harvard.edu")
library("icews")
dir.create("~/Downloads/icews")
download_data("~/Downloads/icews")
```
This will conventiently also re-use and update any files already in the same directory. Zipped files will be unzipped. E.g. the yearly data files come in zipped a file with the pattern "events.2018.yyyymmddhhmmss.tab", where the "yyyymmddhhmmss" part changes might change if data have been updated. The downloader can identify this and will replace the old with the new file.
Just in case, you can do a dry run that will not actually make any changes:
```{r, eval = FALSE}
download_data("~/Downloads/icews", dryrun = TRUE)
```
```bash
Found 25 local data file(s)
Downloading 2 new file(s)
Removing 1 old local file(s)
Plan:
Download 'events.1995.20150313082510.tab.zip'
Download 'events.1996.20150313082528.tab.zip'
Remove 'events.1996.20140313082528.tab'
```
The events come in (zipped) tab-separated files. To load all of these into memory in a big combined data frame with about 16 million rows (~2.5Gb):
```{r, eval = FALSE}
events <- read_icews("~/Downloads/icews")
```
Beyond this basic usage, the goal is to abstract as many little pains away as possible. To that end:
### Persist the data directory location
The package can keep track of the data location via variables stored in the package options. The easiest way is to add these to an ".Rprofile" file so that they are available each time R starts up.
```{r, eval = FALSE}
setup_icews(data_dir = "/where/should/data/be", use_db = TRUE,
keep_files = TRUE, r_profile = TRUE)
```
This will open the ".Rprofile" file and tell you what to add to it (requires [usethis](https://cran.r-project.org/package=usethis) to be installed). From now on the package knows where your data lives, and most of the functions can be called without specifying any directory or path arguments.
Under the hood, this will set three R options that the package uses in the downloader functions:
```{r, eval = FALSE}
# ICEWS data location and options
options(icews.data_dir = "~/path/to/icews_data")
options(icews.use_db = TRUE)
options(icews.keep_files = TRUE)
```
### Use a SQLite database that keeps in sync with Dataverse
To setup and populate a database with the current version on Dataverse, use this command:
```{r, eval = FALSE}
# assumes setup_icews with use_db = TRUE has already been called
update_icews(dryrun = FALSE)
```
This will download any data files needed from Dataverse, and create and populate a SQLite database with them. The events will be in a table called "events". To connect, use `connect()`; this returns a RSQLite database connection. From then on, it can be used like this:
```{r, eval = FALSE}
library("DBI")
library("dplyr")
con <- connect()
dbGetQuery(con, "SELECT count(*) FROM events;")
# or
tbl(con, "events") %>% summarize(n = n())
```
When done, it is good etiquette to close to the database connection:
```{r, eval = FALSE}
DBI::dbDisconnect(con)
```
### Bonus: CAMEO codes
Also included is a dictionary of the CAMEO code for event types. This includes quad and penta category mappings as well.
```{r}
data("cameo_codes")
str(cameo_codes)
```
And, a dictionary mapping Goldstein scores to CAMEO codes.
```{r}
data("goldstein_mappings")
str(goldstein_mappings)
```