Coreference resolution is the task of clustering mentions in text that refer to the same underlying real world entities.
Example:
+-----------+
| |
I voted for Obama because he was most aligned with my values", she said.
| | |
+-------------------------------------------------+------------+
"I", "my", and "she" belong to the same cluster and "Obama" and "he" belong to the same cluster.
Experiments are conducted on the data of the CoNLL-2012 shared task, which uses OntoNotes coreference annotations. Papers report the precision, recall, and F1 of the MUC, B3, and CEAFφ4 metrics using the official CoNLL-2012 evaluation scripts. The main evaluation metric is the average F1 of the three metrics.
Model | Avg F1 | Paper / Source | Code |
---|---|---|---|
Joshi et al. (2019)1 | 79.6 | SpanBERT: Improving Pre-training by Representing and Predicting Spans | Official |
Joshi et al. (2019)2 | 76.9 | BERT for Coreference Resolution: Baselines and Analysis | Official |
Kantor and Globerson (2019) | 76.6 | Coreference Resolution with Entity Equalization | Official |
Fei et al. (2019) | 73.8 | End-to-end Deep Reinforcement Learning Based Coreference Resolution | |
(Lee et al., 2017)+ELMo (Peters et al., 2018)+coarse-to-fine & second-order inference (Lee et al., 2018) | 73.0 | Higher-order Coreference Resolution with Coarse-to-fine Inference | Official |
(Lee et al., 2017)+ELMo (Peters et al., 2018) | 70.4 | Deep contextualized word representations | |
Lee et al. (2017) | 67.2 | End-to-end Neural Coreference Resolution |
[1] Joshi et al. (2019): (Lee et al., 2017)+coarse-to-fine & second-order inference (Lee et al., 2018)+SpanBERT (Joshi et al., 2019)
[2] Joshi et al. (2019): (Lee et al., 2017)+coarse-to-fine & second-order inference (Lee et al., 2018)+BERT (Devlin et al., 2019)
Experiments are conducted on GAP dataset. Metrics used are F1 score on Masculine (M) and Feminine (F) examples, Overall, and a Bias factor calculated as F / M.
Model | Overall F1 | Masculine F1 (M) | Feminine F1 (F) | Bias (F/M) | Paper / Source | Code |
---|---|---|---|---|---|---|
Attree et al. (2019) | 92.5 | 94.0 | 91.1 | 0.97 | Gendered Ambiguous Pronouns Shared Task: Boosting Model Confidence by Evidence Pooling | GREP |
Chada et al. (2019) | 90.2 | 90.9 | 89.5 | 0.98 | Gendered Pronoun Resolution using BERT and an extractive question answering formulation | CorefQA |