README.Rmd

---
output:
  md_document:
    variant: gfm
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)
```

# icews

[![CRAN status](https://www.r-pkg.org/badges/version/icews)](https://cran.r-project.org/package=icews)
[![R-CMD-check](https://github.com/andybega/icews/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/andybega/icews/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/andybega/icews/branch/master/graph/badge.svg)](https://codecov.io/gh/andybega/icews?branch=master)

_Note: the ICEWS data were discontinued on 11 April 2023. You can still use this package to download the data from dataverse however._

Get the ICEWS event data from the Harvard Dataverse repos at [https://doi.org/10.7910/DVN/28075](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/28075) (historic data) and [https://doi.org/10.7910/DVN/QI2T9A](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QI2T9A) (weekly updates).

The icews package provides these major features:

- get the ICEWS event data without having to deal with Dataverse
- use raw data files (tab-separated variables, .tsv) or a database (SQLite3) or both as the storage backend
- set options so that in future R sessions icews knows where your data lives
- icews keeps the local data in sync with the latest versions on Dataverse

## Installation

Not on CRAN, so you will have to install via remotes or devtools:

```{r github-install, eval = FALSE}
library("remotes")
remotes::install_github("andybega/icews")
```

The **icews** package relies on the R [**dataverse** client](https://github.com/IQSS/dataverse-client-r). Note that this package requires a dataverse API token to correctly work. See the package README and https://guides.dataverse.org/en/latest/user/account.html#api-token. 

## Usage

**tl;dr**: get a SQLite database with the current events on Dataverse with this code; otherwise read below for more details. 

```{r, eval = FALSE}
Sys.setenv(DATAVERSE_KEY = "{api key}")  # see ?dataverse_api_token
Sys.setenv(DATAVERSE_SERVER = "dataverse.harvard.edu")
library("icews")
library("DBI")
library("dplyr")

setup_icews(data_dir = "/where/should/data/be", use_db = TRUE, keep_files = TRUE,
            r_profile = TRUE)
# this will give instructions for what to add to .Rprofile so that settings
# persist between R sessions

update_icews(dryrun = TRUE)
# Should list proposed downloads, ingests, etc.
update_icews(dryrun = FALSE)
# Wait until all is done; like 45 minutes or more the first time around

# The events will be in a table called "events". 
# To get the data:
query_icews("SELECT count(*) AS n FROM events;")
# or
con <- connect()
DBI::dbGetQuery(con, "SELECT count(*) AS n FROM events;")
# or
con <- connect()
dplyr::tbl(con, "events") %>% summarize(n = n())
# or, 
# read all 17+ million rows into memory
events <- read_icews()
```


### Files only

In the most simple use case, you can use the package to download the ICEWS event data, which comes in several tab-serparated value (TSV) files, without having to deal with Dataverse. 

```{r, eval = FALSE}
Sys.setenv(DATAVERSE_SERVER = "dataverse.harvard.edu")
library("icews")

dir.create("~/Downloads/icews")
download_data("~/Downloads/icews")
```

This will conventiently also re-use and update any files already in the same directory. Zipped files will be unzipped. E.g. the yearly data files come in zipped a file with the pattern "events.2018.yyyymmddhhmmss.tab", where the "yyyymmddhhmmss" part changes might change if data have been updated. The downloader can identify this and will replace the old with the new file. 

Just in case, you can do a dry run that will not actually make any changes:

```{r, eval = FALSE}
download_data("~/Downloads/icews", dryrun = TRUE)
```

```bash
Found 25 local data file(s)
Downloading 2 new file(s)
Removing 1 old local file(s)

Plan:
Download 'events.1995.20150313082510.tab.zip'
Download 'events.1996.20150313082528.tab.zip'
Remove   'events.1996.20140313082528.tab'
```

The events come in (zipped) tab-separated files. To load all of these into memory in a big combined data frame with about 16 million rows (~2.5Gb):

```{r, eval = FALSE}
events <- read_icews("~/Downloads/icews")
```

Beyond this basic usage, the goal is to abstract as many little pains away as possible. To that end: 

### Persist the data directory location

The package can keep track of the data location via variables stored in the package options. The easiest way is to add these to an ".Rprofile" file so that they are available each time R starts up.

```{r, eval = FALSE}
setup_icews(data_dir = "/where/should/data/be", use_db = TRUE, 
            keep_files = TRUE, r_profile = TRUE)
```

This will open the ".Rprofile" file and tell you what to add to it (requires [usethis](https://cran.r-project.org/package=usethis) to be installed). From now on the package knows where your data lives, and most of the functions can be called without specifying any directory or path arguments.  

Under the hood, this will set three R options that the package uses in the downloader functions:

```{r, eval = FALSE}
# ICEWS data location and options
options(icews.data_dir   = "~/path/to/icews_data")
options(icews.use_db     = TRUE)
options(icews.keep_files = TRUE)
```

### Use a SQLite database that keeps in sync with Dataverse

To setup and populate a database with the current version on Dataverse, use this command:

```{r, eval = FALSE}
# assumes setup_icews with use_db = TRUE has already been called
update_icews(dryrun = FALSE)
```

This will download any data files needed from Dataverse, and create and populate a SQLite database with them. The events will be in a table called "events". To connect, use `connect()`; this returns a RSQLite database connection. From then on, it can be used like this:

```{r, eval = FALSE}
library("DBI")
library("dplyr")

con <- connect()

dbGetQuery(con, "SELECT count(*) FROM events;")
# or
tbl(con, "events") %>% summarize(n = n())
```

When done, it is good etiquette to close to the database connection:

```{r, eval = FALSE}
DBI::dbDisconnect(con)
```

### Bonus: CAMEO codes

Also included is a dictionary of the CAMEO code for event types. This includes quad and penta category mappings as well. 

```{r}
data("cameo_codes")
str(cameo_codes)
```

And, a dictionary mapping Goldstein scores to CAMEO codes. 

```{r}
data("goldstein_mappings")
str(goldstein_mappings)
```