From c961dd80c715e367c9f86cda0f098f2e7f9d54d7 Mon Sep 17 00:00:00 2001
From: minervaning
Date: Sun, 17 Mar 2024 21:03:13 +0800
Subject: [PATCH] .
---
README.md | 79 ++++++++++++++++++++++++++++++------------------------
index.html | 38 +++++++++++++-------------
2 files changed, 62 insertions(+), 55 deletions(-)
diff --git a/README.md b/README.md
index e22d9ad..b5e1946 100644
--- a/README.md
+++ b/README.md
@@ -11,62 +11,73 @@
- [๐ Introduction](#-introduction)
- [๐ Dataset](#-dataset)
- [Data Description](#data-description)
- - [Data Scale](#data-scale)
- [๐ Evaluation \& Submission](#-evaluation--submission)
- [๐ License](#-license)
- [๐ Citation](#-citation)
- - [๐ฎ Contact](#-contact)
- - [Acknowledgement](#acknowledgement)
+- [๐ฎ Contact](#-contact)
+- [Acknowledgement](#acknowledgement)
# ๐ Introduction
- Welcome to our GitHub repository for the "Evaluating Open-QA Evaluation" [paper](https://arxiv.org/abs/2305.12421), a comprehensive study on the evaluating of evaluation methods in Open Question Answering (Open-QA) systems.
-
- Open-QA systems, which generate answers to questions with a vast range of possible topics, have become an increasingly significant research field in recent years. However, accurately evaluating these systems remains challenging, and currently lacks robust, reliable methods.
-
- In response to this, we introduce the QA Evaluation task (QA-Eval), a new task that rigorously tests various evaluation methods for their ability to accurately assess the relevance of machine-generated answers to a set of gold standard answers within an Open-QA context. This task requires the evaluating method to discern whether a machine-generated answer aligns with the gold standard answer, with performance evaluated against human-annotated results.
-
- We sourced our data from the test sets of two well-established QA datasets, Natural Questions (NQ) and TriviaQA. We ask several representative models, including FiD, ChatGPT-(3.5/4), GPT-3.5 and BingChat, to answer the questions. We then manually annotated the correctness of each question-answer pair.
-
- Through this work, we hope to foster a deeper understanding of Open-QA systems, their evaluations, and aid the research community in developing more reliable automatic evaluation tools.
+
+ **NovelQA** is a benchmark to evaluate the long-text understanding and retrieval ability of LLMs. This dataset is constructed by manually collecting questions and answers about English novels which are above 50,000 words. Moreover, most of the questions are designed to be focusing on either minor details in the novel, or requiring information spanning multiple chapters, which are inherently challenging for LLMs. We welcome submissions with any LLM with long-context abilities!
+
+ Due to the possible unstability of the platform, we also create an [official leaderboard](https://novelqa.github.io/) by incorporating the Codabench leaderboard with mannually checked results. We encourage participants to refer to this leaderboard for your ranking.
+
+ The rapid advancement of Large Language Models (LLMs) has introduced a new frontier in natural language processing, particularly in understanding and processing long-context information. However, the evaluation of these models' long-context abilities remains a challenge due to the limitations of current benchmarks.
+ To address this gap, we introduce NovelQA, a benchmark specifically designed to test the capabilities of LLMs with extended texts. Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding in LLMs.
+
+
# ๐ Dataset
## Data Description
-
+
Each data point in our dataset is represented as a dictionary with the following keys:
```
-
+ {
+ "Question": The input question,
+ "Options": [
+ Option A,
+ Option B,
+ Option C,
+ Option D
+ ],
+ "Complex": "mh",
+ "Aspect": "times"
+ }
```
Here is an example of a data point:
```json
-
+ {
+ "Question": "How many times has Robert written letters to his sister?",
+ "Options": [
+ "11",
+ "9",
+ "12",
+ "10"
+ ],
+ "Complex": "mh",
+ "Aspect": "times"
+ }
```
- ## Data Scale
- The scale of our dataset is detailed in the table below:
-
- |models | Natural Questions| TriviaQA |
- |------------------------------|------------------------------|------------------------------|
- |DPR+FiD |3610|2000|
- |GPT-3.5 |3610|2000|
- |ChatGPT-3.5 |3610|2000|
- |ChatGPT-4 |3610|2000|
- |Bing Chat |3610|2000|
# ๐ Evaluation & Submission
-
- The work flow of our online benchmark is as follows.
+ Due to confidentiality considerations, the submission procedure is deployed through multiple steps on several platforms. An overview of the submission is shown as the following flowchart.
```mermaid
-
graph LR
-
A[[๐ค Huggingface]] --(Input Data)--> B[[๐คYour Model]]
B --(Model output)--> C[[โ๏ธCodabench]]
C --(Accuracy Score)--> D[[๐ณ๏ธGoogle Form]]
D ----> E[[๐Leaderboard Website]]
-
```
+ Our input data (including novel, question and options) is opensourced on the [๐ค Huggingface]() platform. Participants who expect to evaluate their model are expected to download the data through Huggingface first. You may either execute the generative subtask with only the novel and quetion, or executing the multichoice subtask by inputing the novel, question and the options. Warning: The input data are only for internal evaluation use. Please do not publicly spread the input data online. The competition hosts are not responsible for any possible violation of novel copyright caused by the participants' spreading the input data publicly online.
+
+ After inputing the data and obtaining the model output, you are expected to submit your model output to the [โ๏ธ Codabench]() platform for evaluation. Such a procedure is set for the purpose of perserving the confidentiality of the gold answers. The Codabench platform automatically runs evaluation on your result, and generates the accuracy score within an average of 5 minutes. If your submission fails or your evaluation is obviously above average, you may email us with the results to have us manually run the evaluation for you. For details about the Codabench platform and the evaluation procedure, see our instructions in our Codabench page.
+
+ Your accuracy score are further expected to submit to us through the [๐ณ๏ธ Google Form]() if you evluate your results through Codabench to have us update it on our [๐ Leaderboard](). Our leaderboard presents the Top 7 models on the two subtasks seperately.
+
# ๐ License
This dataset is released under the [Apache-2.0 License](LICENSE).
@@ -77,12 +88,10 @@ If you use this dataset in your research, please cite it as follows:
```bibtex
```
-## ๐ฎ Contact
+# ๐ฎ Contact
We welcome contributions to improve this dataset!
If you have any questions or feedback, please feel free to reach out at wangcunxiang@westlake.edu.cn.
-## Acknowledgement
-
-This leaderboard adopts the style of [bird-bench](https://github.com/bird-bench/bird-bench.github.io).
+# Acknowledgement
-[Official Site](https://novelqa.github.io/)
\ No newline at end of file
+This leaderboard adopts the style of [bird-bench](https://github.com/bird-bench/bird-bench.github.io).
\ No newline at end of file
diff --git a/index.html b/index.html
index 6232e26..10be8a7 100644
--- a/index.html
+++ b/index.html
@@ -129,35 +129,33 @@ Data Description
Each data point in our dataset is represented as a dictionary with the following keys:
- "novel title": [
- {
- "ques": question input,
- "ops": [
+ {
+ "Question": The input question,
+ "Options": [
Option A,
Option B,
Option C,
Option D
- ]
- },
- ...
- ]
+ ],
+ "Complex": "mh",
+ "Aspect": "times"
+ }
Here is an example of a data point:
- "Ayala's Angel": [
- {
- "ques": "In novel Ayala's Angel, has someone ever screamed in the novel? If so, how many times have they screamed?",
- "ops": [
- "No, never",
- "Yes, 1",
- "Yes, 2",
- "Yes, 3"
- ]
- },
- ...
- ]
+ {
+ "Question": "How many times has Robert written letters to his sister?",
+ "Options": [
+ "11",
+ "9",
+ "12",
+ "10"
+ ],
+ "Complex": "mh",
+ "Aspect": "times"
+ }