forked from EleutherAI/lm-evaluation-harness
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Kirill Semin
committed
Mar 2, 2024
1 parent
694b40a
commit 31310b4
Showing
3 changed files
with
73 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# HellaSwag | ||
|
||
### Paper | ||
|
||
Title: `HellaSwag: Can a Machine Really Finish Your Sentence?` | ||
|
||
Abstract: https://arxiv.org/abs/1905.07830 | ||
|
||
Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? | ||
In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. | ||
Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges. | ||
|
||
Homepage: `https://rowanzellers.com/hellaswag/` | ||
|
||
|
||
### Citation | ||
|
||
``` | ||
@inproceedings{zellers2019hellaswag, | ||
title={HellaSwag: Can a Machine Really Finish Your Sentence?}, | ||
author={Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin}, | ||
booktitle ={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, | ||
year={2019} | ||
} | ||
``` | ||
|
||
### Groups and Tasks | ||
|
||
#### Groups | ||
|
||
- Not part of a group yet | ||
|
||
#### Tasks | ||
|
||
- `hellaswag` | ||
|
||
|
||
### Checklist | ||
|
||
For adding novel benchmarks/datasets to the library: | ||
* [x] Is the task an existing benchmark in the literature? | ||
* [x] Have you referenced the original paper that introduced the task? | ||
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test? | ||
|
||
|
||
If other tasks on this dataset are already supported: | ||
* [ ] Is the "Main" variant of this task clearly denoted? | ||
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? | ||
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
import datasets | ||
import re | ||
|
||
|
||
def preprocess(text): | ||
text = text.strip() | ||
# NOTE: Brackets are artifacts of the WikiHow dataset portion of HellaSwag. | ||
text = text.replace(" [title]", ". ") | ||
text = re.sub("\\[.*?\\]", "", text) | ||
text = text.replace(" ", " ") | ||
return text | ||
|
||
|
||
def process_docs(dataset: datasets.Dataset) -> datasets.Dataset: | ||
def _process_doc(doc): | ||
ctx = doc["ctx_a"] + " " + doc["ctx_b"].capitalize() | ||
out_doc = { | ||
"query": preprocess(doc["activity_label"] + ": " + ctx), | ||
"choices": [preprocess(ending) for ending in doc["endings"]], | ||
"gold": int(doc["label"]), | ||
} | ||
return out_doc | ||
|
||
return dataset.map(_process_doc) |