This project evaluates the performance of different large language models (LLMs) on the Mirror-Consistency metric using various datasets. The experiments are conducted using four major LLMs and can be customized by altering the model or dataset configurations.
This work has been accepted to EMNLP 2024 as Short Findings.
We have utilized four LLMs for our experiments:
-
gpt3.5-turbo-0613 - For using this model, provide the corresponding API by replacing the
_gpt35_api
function inmodel.py
. -
qwen-turbo - This model also requires the corresponding API which should be replaced in the
_qwen_turbo_api
function inmodel.py
. -
Llama3-8B-Instruct - To use this model, download the Hugging Face version of the model parameters and set the
model_path
parameter inrun.py
to the path of the downloaded model weights. -
Llama3-70B-Instruct - Similar to the Llama3-8B model, ensure to download and correctly reference the model parameters in
run.py
.
To switch between these models, modify the config.model_name
in run.py
.
Our experiments utilize the following datasets:
-
GSM8K: A dataset of grade-school math word problems to test arithmetic reasoning.
-
SVAMP: A dataset designed to test the robustness of mathematical problem-solving models.
-
Date Understanding: A dataset focusing on the comprehension of date and time expressions in natural language.
-
StrategyQA: A question-answering dataset that requires multi-hop reasoning and strategy.
To use a different dataset, update the config.dataset_name
in run.py
.
For Mirror-Consistency experiments:
-
Set Up the Model: Follow the instructions in the Models section to configure the desired model.
-
Modify Parameters: Adjust
run.py
with the desired model and dataset parameters. -
Execute Script: Run
python run.py
directly to start the experiments. We also provide tools for detailed performance analysis incomplete_evaluate.py
.
check_pipeline.ipynb
: A Jupyter notebook that serves as a simple example of the generation process using the configured models.