Skip to content

[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

License

Notifications You must be signed in to change notification settings

mtbench101/mt-bench-101

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MT-Bench-101

[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

📃 Paper • 🏆 Leaderboard (WIP)

Todo

  • Release the research paper.
  • Release the evaluation code.
  • Release the dataset.
  • Develop and launch an online leaderboard.

💥What's New

  • [2024.05.28] Code and dataset are now available (See Installation for details). 🎉🎉🎉
  • [2024.05.15] MT-Bench-101 has been accepted by ACL 2024 main conference. 🎉🎉🎉
  • [2024.02.22] Our paper is now accessible at https://arxiv.org/abs/2402.14762. 🎉🎉🎉

About MT-Bench-101

MT-Bench-101 is specifically designed to evaluate the finegrained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks.

Installation

We integrated our MT-Bench-101 benchmark into OpenCompass through this PR. We integrated our MT-Bench-101 benchmark into our forked OpenCompass. OpenCompass is a comprehensive platform for large model evaluation, which provides a unified interface for evaluating various tasks and is easy to use.

Create virtual env

Create virtual env for OpenCompass, see OpenCompass website if you have any questions, and clone OpenCompass code.

conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/sefira/opencompass opencompass
cd opencompass
pip install -e .

Data Preparation

Our data has been stored in the following folder under this repo.

# Dataset folder under this repo
data/subjective/mtbench101.jsonl

You should copy the data file from this repo into the same path of OpenCompass.

# Download dataset from this repo and copy to OpenCompass folder
# After 'cd opencompass'
mkdir data/subjective/
cp -rf $PATH_THIS_REPO/data/subjective/mtbench101.jsonl data/subjective/

Evaluation

# run
python run.py configs/eval_subjective_mtbench101.py
# debug
python run.py configs/eval_subjective_mtbench101.py --debug

Leaderboard

image

Citation

If you find our work helpful, feel free to give us a cite.

@article{bai2024mt,
  title={MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues},
  author={Bai, Ge and Liu, Jie and Bu, Xingyuan and He, Yancheng and Liu, Jiaheng and Zhou, Zhanhui and Lin, Zhuoran and Su, Wenbo and Ge, Tiezheng and Zheng, Bo and others},
  journal={arXiv preprint arXiv:2402.14762},
  year={2024}
}

About

[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published