TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

TurtleBench is a dynamic evaluation benchmark designed to assess the reasoning capabilities of large language models (LLMs) through real-world yes/no puzzles, emphasizing logical reasoning over knowledge recall by using user-generated data from a Turtle Soup puzzle platform.

Highlights

Objective and Unbiased: Eliminates the need for background knowledge, focusing purely on reasoning abilities.
Quantifiable Results: Clear, measurable outcomes (correct/incorrect/unknown) for easy comparison.
Constantly Evolving: Uses real user-generated questions, making it impossible to "game" the system.
Language Understanding: Tests the model's ability to comprehend context and make logical inferences.

Quick Start

# Install dependencies
conda create -n turtle python=3.10
conda activate turtle
pip install -r requirements.txt

# Set up configuration
mv config_example.ini config.ini
# Edit config.ini to add your API key

# Run evaluations
python eval.py --shot 0 --models Claude_3_5_Sonnet --language zh --save_interval 10 --time_delay 2

# Analyze results
python analyst.py

Project Structure

.
├── README.md                  # Project description and documentation
├── analyst.py                 # Data analysis script
├── archived                   # Archived result files
├── config.ini                 # Configuration file (for actual run)
├── config_example.ini         # Example configuration file
├── data                       # Directory containing project data
│   ├── en                     # Subdirectory for English data
│   └── zh                     # Subdirectory for Chinese data
├── eval.py                    # Evaluation script
├── insights                   # Analysis and visualization-related scripts
│   ├── case_handler.py         # OpenAI o1 Error case handling script
│   ├── plots.ipynb             # Data visualization and plot generation
│   ├── token_boxplot.py        # Box plot for token usage
│   └── token_cal.py            # Token calculation script
├── logs                       # Directory for storing logs
├── models.py                  # Model definition script
├── outputs                    # Directory for storing output results
├── prompts                    # Files related to prompt generation
├── requirements.txt           # Project dependencies list
└── stats                      # Statistics and result-related files

Results

Note: You can find detailed experiment results in the archived directory.

Citation

@article{TurtleBench,
      title={TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles}, 
      author={Qingchen Yu and Shichao Song and Ke Fang and Yunfeng Shi and Zifan Zheng and Hanyu Wang and Simin Niu and Zhiyu Li},
      journal={arXiv preprint arXiv:2410.05262},
      year={2024},
}

TODOs

Click me to show all TODOs

feat: Add LLM loading method
feat: Optimize answer extraction & matching methods

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Highlights

Quick Start

Project Structure

Results

Citation

TODOs

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
archived		archived
assets		assets
data		data
insights		insights
logs		logs
outputs		outputs
prompts		prompts
stats		stats
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyst.py		analyst.py
config_example.ini		config_example.ini
eval.py		eval.py
models.py		models.py
requirements.txt		requirements.txt

License

mazzzystar/TurtleBench

Folders and files

Latest commit

History

Repository files navigation

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Highlights

Quick Start

Project Structure

Results

Citation

TODOs

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages