Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Fine-tuning Mistral 7B Model - Results Discrepancy #29

Open
lyf-00 opened this issue Feb 28, 2024 · 4 comments
Open

Issue with Fine-tuning Mistral 7B Model - Results Discrepancy #29

lyf-00 opened this issue Feb 28, 2024 · 4 comments

Comments

@lyf-00
Copy link

lyf-00 commented Feb 28, 2024

Hello,
I attempt to replicate the experiment using metamathQA dataset to finetune mistral-7b, but the results I obtained do not match the ones shared in the repository.

Reproduction steps

I used the following parameters in run_mistral.sh.

export MODEL_PATH='mistralai/Mistral-7B-v0.1'
export SAVE_PATH='0224_mistral-7b-metamath395'
export MASTER_ADDR="localhost"
export MASTER_PORT="1231"
export GLOO_SOCKET_IFNAME="lo"
export NCCL_SOCKET_IFNAME="lo"
export WANDB_DISABLED=true
export HF_TOKEN="token of your huggingface"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=8 --use_env train_math.py \
    --model_name_or_path $MODEL_PATH \
    --data_path MetaMathQA-395K.json \
    --data_length 10000000 \
    --bf16 True \
    --output_dir $SAVE_PATH \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100000 \
    --save_total_limit 0 \
    --learning_rate 5e-6 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'MistralDecoderLayer' \
    --tf32 True
python eval_gsm8k.py --model $SAVE_PATH --data_file ./data/test/GSM8K_test.jsonl
python eval_math.py --model $SAVE_PATH --data_file ./data/test/MATH_test.jsonl

and I get
gsm8k acc==== 0.6618650492797574
math acc==== 0.2274

which is different from the reported 77.7 and 28.2

Environment details

Here are the detailed of my Python environment:

transformers==4.34.0
wandb==0.15.3
torch==2.0.1
sentencepiece==0.1.99
tokenizers==0.14
accelerate==0.21.0
bitsandbytes==0.40.0

I would appreciate any guidance or suggestions you could provide to help resolve this discrepancy. Thank you in advance for your time and assistance.

Best regards,
lyf-00

@nuochenpku
Copy link

Same problem for me! Do you solve it? I also can not reproduce the results whatever the Mistral or LLaMA

@ytyz1307zzh
Copy link

I encountered the same issue when trying to train Mistral-7B on MetaMathQA. My environment is:

transformers==4.34.0
torch==2.0.1
sentencepiece==0.1.99
tokenizers==0.14.1
accelerate==0.21.0

I only got a 69% accuracy on GSM8K and 24% on MATH after 3 epochs with LR 5e-6 and global batch 128. Although due to the limitation of my computational resources, I added gradient checkpointing and flash attention to the original code, and also changed the per_device_batch_size to 1 (so gradient accumulates for 16 steps on 8 GPUs), but I don't think these modifications will bring significant differences to the performance.

@Nipers
Copy link

Nipers commented Nov 4, 2024

My result with llama-factory and hyperparameters reported in the paper is 72.2%on GSM8K, I do not under stand why there are so many failures when trying to reproduce the result.

@AaronZLT
Copy link

My result with llama-factory and hyperparameters reported in the paper is 72.2%on GSM8K, I do not under stand why there are so many failures when trying to reproduce the result.

I just can't get such a higher score, and I don't think this is an unique case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants