[EMNLP 2024] Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models
This repository contains the code for the paper "Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models". Below is its workflow.
You can follow the steps below to quickly get up and running with Multi-expert Prompting.
-
In a conda env with PyTorch / CUDA available clone and download this repository.
-
Create and activate a new virtual environment.
conda create -n mep python=3.11 conda activate mep
-
In the top-level directory run:
pip install -r requirements.txt
-
To run OpenAI models, you need to export your API key:
export OPENAI_API_KEY=your_api_key_here
-
Once you got everything installed correctly, use the following command:
python src/interactive.py --model=[model] --num_experts=[number-of-experts] --temperature=[temperaure] [--verbose]
Currently, we support the following open-source (Mistral, Meta-llama) and proprietary models (OpenAI models):
- --model:
gpt-4o
,chatgpt-4o-latest
,gpt-4o-2024-08-06
,gpt-3.5-turbo
,mistralai/Mistral-7B-Instruct-v0.2
,meta-llama/Llama-3.1-8B-Instruct
. - --num_experts: any number. It is recommended to be less than 10 to avoid context window size exceedings.
- --temperature: often between 0 and 1.
Example with
gpt-3.5-turbo
with 3 experts and temperature equal 0:python src/interactive.py --model="gpt-3.5-turbo" --num_experts=3 --temperature=0 --verbose
- --model:
Benchmark experiments: Benchmarking data and scripts are coming soon! Alternatively, you can shortly customize src/interactive.py
to run your own benchmark experiments.
Benchmark evaluations: We share our outputs in the folder: ./evaluation/results
. To obtain the evaluation results, perform the following steps:
-
Navigate to the directory
metrics
.cd Multi-expert-Prompting/evaluation/metrics
-
Run the scripts there to compute metrics:
python BOLD_compute.py python TOXICITY_compute.py python HONEST_compute.py
Note: Evaluation instructions for TruthfulQA, FactualityPrompt and ExpertQA are coming soon!
The table below summarizes the performance of Multi-expert Prompting compared to several strong baselines. The details of our outputs are shared in the folder: ./evaluation/results
.
Mistral-7B-Inst. v0.2 | TruthfulQA ↑ | FactualityPrompt ↓ | BOLD ↓ | HONEST ↓ |
---|---|---|---|---|
Zero-shot | 76.00 | 8.98/16.07 | 0.000 | 0.012/0.009 |
Zero-shot-CoT | 78.70 | 9.28/14.87 | 0.000 | 0.014/0.013 |
Self-refine | 81.88 | 10.36/14.95 | 0.000 | 0.007/0.008 |
Universal Self-consistency | 81.64 | 9.98/15.21 | 0.000 | 0.007/0.008 |
Multi-agent Debate | 80.78 | 17.57/18.27 | 0.000 | 0.004/0.007 |
ExpertPrompting | 80.34 | 11.43/15.32 | 0.000 | 0.005/0.005 |
Multi-expert Prompting | 87.15 | 8.16/14.70 | 0.000 | 0.003/0.005 |
ChatGPT | TruthfulQA ↑ | FactualityPrompt ↓ | BOLD ↓ | HONEST ↓ |
---|---|---|---|---|
Zero-shot | 68.05 | 6.99/12.90 | 0.163 | 0.038/0.023 |
Zero-shot-CoT | 70.38 | 6.93/13.75 | 0.163 | 0.006/0.005 |
Self-refine | 75.89 | 7.11/13.96 | 0.064 | 0.006/0.007 |
Universal Self-consistency | 77.11 | 5.51/9.71 | 0.000 | 0.010/0.008 |
Multi-agent Debate | 64.87 | 5.64/13.06 | 0.000 | 0.005/0.004 |
ExpertPrompting | 80.66 | 5.64/15.66 | 0.129 | 0.004/0.004 |
Multi-expert Prompting | 89.35 | 4.54/9.45 | 0.000 | 0.004/0.003 |
Key: ↑ indicates higher is better; ↓ indicates lower is better.
Please report any software “bug”, or other problems with the models through one of the following means:
- This Github repo.
- Do Xuan Long via xuanlong.do@u.nus.edu.
If you find this repository helpful in your research, we appreciate your ⭐ and the paper citation:
@misc{long2024multiexpertpromptingimprovesreliability,
title={Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models},
author={Do Xuan Long and Duong Ngoc Yen and Anh Tuan Luu and Kenji Kawaguchi and Min-Yen Kan and Nancy F. Chen},
year={2024},
eprint={2411.00492},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.00492},
}
We would like to acknowledge the Huggingface evaluate and Huggingface transformers.