This repository contains the code for the paper Prediction-Powered Ranking of Large Language Models.
All the code is written in Python 3.11.2
In order to create a virtual environment and install the project dependencies you can run the following commands:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
python3 scripts/llm-ranking.py <file_config>
where
file_config
: json file of configuration parameters.
seed
: Seed used for random sampling.iterations
: number of times each experiment is runhuman_file
: dataset containing pairwise comparisons by humansllm_files
: list of datasets containing pairwise comparisons by strong LLMs (one for each)experiments_base_dir
: folder where the output will be stored.judges
: list of names of the strong LLMs (same order as their corresponding files inllm_files
)n
: Number of comparisons to subsample fromhuman_file
.alpha
: error probability parameterignore_ties
: Default - 0. If 1, ignore comparisons where the verdict is a tie.methods
: list of methods to construct rank-sets, amongbaseline
,human only
,llm
,ppr
.models
: list of models to be ranked. If[]
, all models inhuman_file
are ranked.
The file config.json contains the configuration parameters we used for our experimentation.
The folder data contains the datasets used for our experimentation:
- human.json: pairwise comparisons by humans.
- gpt-4-0125-preview.json: pairwise comparisons by GPT 4.
- claude-3-opus-20240229.json: pairwise comparisons by Claude 3.
- gpt-3.5-turbo.json: pairwise comparisons by GPT 3.5.
The folder scripts contains the code to construct rank-sets and run experiments:
- llm-ranking.py: main file
- data_process.py: inputs and subsamples from datasets
-
estimate.py: implements Algorithms 1,3,4 from the paper to compute
$\hat{\theta}$ and$\widehat{\Sigma}$ - ranksets.py: implements Algorithm 2 from the paper to construct rank-sets
- run_experiments.py: runs experiments for all input parameters
The folder plots contains the code to create the plots:
- create_plots.py: generates all plots
- Result.py: class that computes metrics for each experiment
- ExperimentCollection.py: class that contains multiple experiments
- PlotRanksets.py: code to plot figures 3, 4, 9 and 10
- PlotIntersectSize.py: code to plot figures 1, 2, 6, 7 and 8
The results are stored in directory experiments_base_dir
.
For every combination of n and n
and alpha
, a new child folder is created inside
experiments_base_dir
.
For example, for n=1000 and alpha=0.05, folder experiments_base_dir/n1000_a05
will be created.
Inside each child folder, multiple json files are created (number equal to number of iterations
).
Each json file is named x.json
where x
the iteration number.
These json files contain the rank-sets of their respective iteration, in json format:
{
method 1: { model 1: [low rank, up rank],
...
model k: [low rank, up rank]
},
...
method m: { model 1: [low rank, up rank],
...
model k: [low rank, up rank]
}
}
First run the experiments via llm-ranking.py using config.json.
Then, install the plot code requirements:
pip install -r plots/plot_requirements.txt
Then, run:
python3 plots/create_plots.py
Figures 3, 4, 9 and 10 are stored in folder plots/ranksets.
Figures 1, 2, 6, 7 and 8 are stored in folder plots/intersect_size.
If you use parts of the code in this repository for your own research purposes, please consider citing:
@article{chatzi2024predictionpowered,
title={Prediction-Powered Ranking of Large Language Models},
author={Ivi Chatzi and Eleni Straitouri and Suhas Thejaswi and Manuel Gomez Rodriguez},
year={2024},
journal={arXiv preprint arXiv:2402.17826}
}