CREST is created to help researchers who work on causal/counterfactual relation extraction/classification, commonsense reasoning, and reading comprehension in natural language to communicate easier and leverage the scattered data resources around this topic. CREST is a user-friendly machine-readable format stored as pandas DataFrame.
Each relation in a CREST-formatted DataFrame has the following fields/values:
original_id
: the id of a relation in the original dataset, if such an id exists.span1
: a list of strings of the first span/argument of the relation.span2
: a list of strings of the second span/argument of the relationsignal
: a list of strings of signals/markers of the relation in context, if any.context
: a text string of the context in which the relation appears.idx
: a dictionary in the form of{'span1': [], 'span2': [], 'signal': []}
to store indexes of span1, span2, and signal in context. Each value in the idx dictionary is a list of lists of start and end indexes of spans and the signal. For example, if span1 has multi tokens in context withstart:end
indexes 2:5 and 10:13, respectively, span1's value inidx
will be[[2, 5], [10, 13]]
. Lists are sorted based on the start indexes of tokens.label
: label of the relation,0: non-causal
,1: causal
direction
: direction between span1 and span2.0: span1 => span2
,1: span1 <= span2
,-1: not-specified
source
: id of the source dataset (ids
are listed in a table below)split
:0: train
,1: dev
,test: 2
. This is the split to which the relation belongs in the original dataset. If there is no split specified for a relation in the original dataset, we assign the relation to thetrain
split by default.
Note: The reason we save a list of strings instead of a single string for span1, span2, and signal is that these text spans may contain multiple non-consecutive sub-spans in context.
We provide helper methods to convert CREST-formatted data to popular formats and annotation schemes, mainly formats that are used across relation extraction/classification tasks. In the following, there is a list of formats for which we have already developed CREST converter methods:
brat
: we have provided helper methods to convert CREST-formatted data frames to brat (brat to CREST converters will be added soon). brat is a popular web-based annotation tool that has been used for a variety of relation extraction NLP tasks. We use brat for two main reasons: 1) better visualization of causal and non-causal relations and their arguments, and 2) modifying relations annotations if needed and adding new annotations to provided context. In the following, there is a sample of a converted version of CREST-formatted relation to brat (example is taken from CaTeRS dataset):TACRED
: TACRED is a large-scale relation extraction dataset. We convert samples from CREST to TACRED since TACRED-formatted data can be easily used as input to many transformers-based language models.
List of data resources already converted to CREST format:
Id | Data resource | Samples | Causal | Non-causal | Availability |
---|---|---|---|---|---|
1 | SemEval 2007 task 4 | 1,529 | 114 | 1,415 | Public |
2 | SemEval 2010 task 8 | 10,717 | 1,331 | 9,386 | Public |
3 | EventCausality | 485 | 485 | - | Public |
4 | Causal-TimeBank | 318 | 318 | - | Public |
5 | EventStoryLine v1.5 | 2,608 | 2,608 | - | Public |
6 | CaTeRS | 2,502 | 308 | 2,194 | Public |
7 | BECauSE v2.1 | 729 | 554 | 175 | Partially Public |
8 | Choice of Plausible Alternatives (COPA) | 2,000 | 1,000 | 1,000 | Public |
9 | The Penn Discourse Treebank (PDTB) 3.0 | 7,991 | 7,991 | - | Not Public |
10 | BioCause Corpus | 844 | 844 | - | Public |
- Are there any related datasets you don’t see in the list? Let us know or feel free to submit a
Pull Request (PR)
, we actively check the PRs and appreciate it☺️ - Is there a well-known or widely-used machine-readable format you think can be added? We can add the helper methods for conversion or we appreciate PRs.
For now, please cite our arXiv paper:
@article{hosseini2021predicting,
title={Predicting Directionality in Causal Relations in Text},
author={Hosseini, Pedram and Broniatowski, David A and Diab, Mona},
journal={arXiv preprint arXiv:2103.13606},
year={2021}
}