Skip to content

Commit

Permalink
test
Browse files Browse the repository at this point in the history
  • Loading branch information
ruoxining committed Mar 17, 2024
1 parent 7306c2e commit c468c7b
Showing 1 changed file with 93 additions and 3 deletions.
96 changes: 93 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,100 @@
# novelqa.github.io
<div align="center">
<h1> NovelQA </h1>

![Data Version](https://img.shields.io/badge/Data%20Version-1.0.0-blue.svg?style=for-the-badge&logo=appveyor)
[![License: Apache-2.0](https://img.shields.io/crates/l/Ap?style=for-the-badge)](https://opensource.org/licenses/Apache-2.0)
[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=for-the-badge)](https://github.com/NovelQA/novelqa.github.io/issues)
</div>

# 📌 Table of Contents
- [📌 Table of Contents](#-table-of-contents)
- [🚀 Introduction](#-introduction)
- [📝 Dataset](#-dataset)
- [Data Description](#data-description)
- [Data Scale](#data-scale)
- [📜 License](#-license)
- [📚 Citation](#-citation)
- [📮 Contact](#-contact)

# 🚀 Introduction
Welcome to our GitHub repository for the "Evaluating Open-QA Evaluation" [paper](https://arxiv.org/abs/2305.12421), a comprehensive study on the evaluating of evaluation methods in Open Question Answering (Open-QA) systems.

Open-QA systems, which generate answers to questions with a vast range of possible topics, have become an increasingly significant research field in recent years. However, accurately evaluating these systems remains challenging, and currently lacks robust, reliable methods.

In response to this, we introduce the QA Evaluation task (QA-Eval), a new task that rigorously tests various evaluation methods for their ability to accurately assess the relevance of machine-generated answers to a set of gold standard answers within an Open-QA context. This task requires the evaluating method to discern whether a machine-generated answer aligns with the gold standard answer, with performance evaluated against human-annotated results.

We sourced our data from the test sets of two well-established QA datasets, Natural Questions (NQ) and TriviaQA. We ask several representative models, including FiD, ChatGPT-(3.5/4), GPT-3.5 and BingChat, to answer the questions. We then manually annotated the correctness of each question-answer pair.

Through this work, we hope to foster a deeper understanding of Open-QA systems, their evaluations, and aid the research community in developing more reliable automatic evaluation tools.


```mermaid
graph TD
A --> B[[🤗]]
```


# 📝 Dataset
## Data Description

Each data point in our dataset is represented as a dictionary with the following keys:
```
"question": The question asked in the Open-QA task.
"golden_answer": The gold standard answer to the question.
"answer_fid", "answer_gpt35", "answer_chatgpt", "answer_gpt4", "answer_newbing": The answers generated by different models (FiD, GPT-3.5, ChatGPT-3.5, GPT-4, and New Bing, respectively).
"judge_fid", "judge_gpt35", "judge_chatgpt", "judge_gpt4", "judge_newbing": Boolean values indicating whether the corresponding model's answer was judged to be correct or incorrect (True for correct, False for incorrect) by human.
"improper": Boolean flag indicating whether the question was inappropriate or not (True for inappropriate, False for proper).
```
Here is an example of a data point:
```json
{
"question": "who got the first nobel prize in physics",
"golden_answer": "Wilhelm Conrad R\u00f6ntgen",
"answer_fid": "Wilhelm R\u00f6ntgen",
"judge_fid": true,
"answer_gpt35": "The first Nobel Prize in Physics was awarded to Wilhelm R\u00f6ntgen in 1901.",
"judge_gpt35": true,
"answer_chatgpt": "The first Nobel Prize in Physics was awarded in 1901 to Wilhelm R\u00f6ntgen for his discovery of X-rays.",
"judge_chatgpt": true,
"answer_gpt4": "The first Nobel Prize in Physics was awarded in 1901 to Wilhelm Conrad R\u00f6ntgen, a German physicist. He received the prize for his discovery of X-rays, a groundbreaking achievement that revolutionized the fields of medicine, physics, and chemistry.",
"judge_gpt4": true,
"answer_newbing": "According to Wikipedia,Wilhelm Conrad R\u00f6ntgen of Germany got the first Nobel Prize in Physics in 1901 for his discovery of X-rays. He received 150,782 SEK (Swedish krona) as the prize money.",
"judge_newbing": true,
"improper": false
}
```
## Data Scale
The scale of our dataset is detailed in the table below:

|models | Natural Questions| TriviaQA |
|------------------------------|------------------------------|------------------------------|
|DPR+FiD |3610|2000|
|GPT-3.5 |3610|2000|
|ChatGPT-3.5 |3610|2000|
|ChatGPT-4 |3610|2000|
|Bing Chat |3610|2000|

# 📜 License

This leaderboard adopts the style of [bird-bench](https://github.com/bird-bench/bird-bench.github.io).
This dataset is released under the [Apache-2.0 License](LICENSE).

# 📚 Citation

If you use this dataset in your research, please cite it as follows:
```bibtex
```
## 📮 Contact
We welcome contributions to improve this dataset!
If you have any questions or feedback, please feel free to reach out at wangcunxiang@westlake.edu.cn.


[Official Site](https://novelqa.github.io/)

This leaderboard adopts the style of [bird-bench](https://github.com/bird-bench/bird-bench.github.io).

[Official Site](https://novelqa.github.io/)

![Workflow](asset/flowchart.png)

0 comments on commit c468c7b

Please sign in to comment.