Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sasarkar/qwen finetuning bucketing #1130

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

sasarkar/qwen finetuning bucketing #1130

wants to merge 12 commits into from

Conversation

ssarkar2
Copy link
Collaborator

@ssarkar2 ssarkar2 commented Jul 9, 2024

What does this PR do?

Add bucketing to Qwen training

earlier version: #1128

results:

time python3 sft.py  --model_name_or_path /root/sasarkar/Qwen2_7B/ --dataset_name "philschmid/dolly-15k-oai-style"  --streaming False  --bf16 True  --subset ''  --output_dir ./model_qwen  --num_train_epochs 1  --per_device_train_batch_size 4  --evaluation_strategy "no"  --save_strategy "no"  --learning_rate 3e-4  --warmup_ratio  0.03  --lr_scheduler_type "cosine"  --max_grad_norm  0.3  --logging_steps 1  --do_train  --do_eval  --use_habana  --use_lazy_mode  --throughput_warmup_steps 3  --lora_r 4  --lora_alpha=16  --lora_dropout=0.05  --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj"  --max_seq_length 512  --adam_epsilon 1e-08 --packing False --num_bucket 8
{'train_runtime': 1218.7741, 'train_samples_per_second': 12.185, 'train_steps_per_second': 3.046, 'train_loss': 1.5069085590719007, 'epoch': 1.0, 'memory_allocated (GB)': 31.57, 'max_memory_allocated (GB)': 39.44, 'total_memory_available (GB)': 94.62}
***** train metrics *****
  epoch                       =         1.0
  max_memory_allocated (GB)   =       39.44
  memory_allocated (GB)       =       31.57
  total_flos                  = 230347401GF
  total_memory_available (GB) =       94.62
  train_loss                  =      1.5069
  train_runtime               =  0:20:18.77
  train_samples_per_second    =      12.185
  train_steps_per_second      =       3.046
100%|██████████| 94/94 [00:22<00:00,  4.13it/s]
***** eval metrics *****
  epoch                       =        1.0
  eval_loss                   =     1.5511
  eval_runtime                = 0:00:22.83
  eval_samples                =        751
  eval_samples_per_second     =     32.141
  eval_steps_per_second       =      4.023
  max_memory_allocated (GB)   =      75.66
  memory_allocated (GB)       =      31.57
  perplexity                  =     4.7165
  total_memory_available (GB) =      94.62

real    21m31.841s
user    35m22.686s
sys     42m57.211s

Numbers without this branch:

{'train_runtime': 8899.1643, 'train_samples_per_second': 1.611, 'train_steps_per_second': 0.403, 'train_loss': 1.5117962875673847, 'epoch': 1.0, 'memory_allocate
d (GB)': 27.85, 'max_memory_allocated (GB)': 34.25, 'total_memory_available (GB)': 94.62}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
██████████| 3565/3565 [2:28:19<00:00,  2.50s/it]
***** train metrics *****
  epoch                       =         1.0
  max_memory_allocated (GB)   =       34.25
  memory_allocated (GB)       =       27.85
  total_flos                  = 184625734GF
  total_memory_available (GB) =       94.62
  train_loss                  =      1.5118
  train_runtime               =  2:28:19.16
  train_samples_per_second    =       1.611
  train_steps_per_second      =       0.403
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
████████████████| 94/94 [00:49<00:00,  1.89it/s]
***** eval metrics *****
  epoch                       =        1.0
  eval_loss                   =     1.5515
  eval_runtime                = 0:00:49.63
  eval_samples                =        751
  eval_samples_per_second     =     14.713
  eval_steps_per_second       =      1.842
  max_memory_allocated (GB)   =      49.32
  memory_allocated (GB)       =      27.85
  perplexity                  =     4.7184
  total_memory_available (GB) =      94.62

real    150m3.097s
user    308m25.621s
sys     106m33.356s

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@ssarkar2 ssarkar2 requested a review from regisss as a code owner July 9, 2024 20:38
@ssarkar2 ssarkar2 changed the title sasarkar/qwen bucketing sasarkar/qwen finetuning bucketing Jul 9, 2024
@ssarkar2 ssarkar2 mentioned this pull request Jul 9, 2024
3 tasks
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@libinta libinta added the run-test Run CI for PRs from external contributors label Jul 16, 2024
@regisss
Copy link
Collaborator

regisss commented Jul 22, 2024

@ssarkar2 This will only work for Qwen? Or is it also appliable with other models?

@ssarkar2
Copy link
Collaborator Author

@ssarkar2 This will only work for Qwen? Or is it also appliable with other models?

Haven't tried other models, but should work for others, since change is in SFT trainer

@ssarkar2
Copy link
Collaborator Author

ssarkar2 commented Jul 23, 2024

8x cmd:

time DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py  --model_name_or_path /root/sasarkar/qwen2_7b/  --dataset_name "philschmid/dolly-15k-oai-style"  --streaming False  --bf16 True  --subset "''"  --output_dir ./model_qwen_70b  --num_train_epochs 1  --per_device_train_batch_size 8 --evaluation_strategy "no"  --save_strategy "no"  --learning_rate 3e-4  --warmup_ratio  0.03  --lr_scheduler_type "cosine"  --max_grad_norm  0.3  --logging_steps 1  --do_train  --do_eval  --use_habana  --use_lazy_mode  --throughput_warmup_steps 3  --lora_r 4  --lora_alpha=16  --lora_dropout=0.05  --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --max_seq_length 512  --adam_epsilon 1e-08 --packing False --num_bucket 8 --max_steps 100

{'train_runtime': 324.0838, 'train_samples_per_second': 30.12, 'train_steps_per_second': 0.471, 'train_loss': 1.5799240708351134, 'epoch': 0.45, 'memory_allocated (GB)': 39.92, 'max_memory_allocated (GB)': 84.57, 'total_memory_available (GB)': 94.62}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [05:24<00:00, 3.24s/it]
***** train metrics *****
epoch = 0.4484
max_memory_allocated (GB) = 84.57
memory_allocated (GB) = 39.92
total_flos = 135047763GF
total_memory_available (GB) = 94.62
train_loss = 1.5799
train_runtime = 0:05:24.08
train_samples_per_second = 30.12
train_steps_per_second = 0.471

***** eval metrics *****
epoch = 0.4484
eval_loss = 1.5758
eval_runtime = 0:00:15.60
eval_samples = 751
eval_samples_per_second = 193.217
eval_steps_per_second = 3.111
max_memory_allocated (GB) = 84.57
memory_allocated (GB) = 39.92
perplexity = 4.8347
total_memory_available (GB) = 94.62

time to run: 6m40.188s

withut bucket:
{'train_runtime': 2761.4812, 'train_samples_per_second': 2.327, 'train_steps_per_second': 0.036, 'train_loss': 1.5813338887691497, 'epoch': 0.45, 'memory_allocated (GB)': 39.09, 'max_memory_allocated (GB)': 83.52, 'total_memory_available (GB)': 94.62}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [46:01<00:00, 27.61s/it]
***** train metrics *****
epoch = 0.4484
max_memory_allocated (GB) = 83.52
memory_allocated (GB) = 39.09
total_flos = 102272664GF
total_memory_available (GB) = 94.62
train_loss = 1.5813
train_runtime = 0:46:01.48
train_samples_per_second = 2.327
train_steps_per_second = 0.036
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:34<00:00, 2.85s/it]
***** eval metrics *****
epoch = 0.4484
eval_loss = 1.5763
eval_runtime = 0:00:38.30
eval_samples = 751
eval_samples_per_second = 23.697
eval_steps_per_second = 0.382
max_memory_allocated (GB) = 83.52
memory_allocated (GB) = 39.09
perplexity = 4.8372
total_memory_available (GB) = 94.62

run time= 48min

@ssarkar2
Copy link
Collaborator Author

TODO:
make qwen2-70b run using lora:

The command i was using earlier didnt have " --deepspeed " maybe that's why?

python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 sft.py \
    --model_name_or_path "/root//mnt/weka/data/Qwen/Qwen2-72B/" \
    --dataset_path "/root/litang/QWEN/qwen2_finetune_bench_scripts/belle_chat_ramdon_10k.json" \
    --bf16 True \
    --output_dir ./model_lora_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 10 \
    --gradient_accumulation_steps 8  \
    --gradient_checkpointing True \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --learning_rate 3e-4 \
    --warmup_ratio  0.03 \
    --lr_scheduler_type "cosine" \
    --max_grad_norm  0.3 \
    --logging_steps 1 \
    --do_train \
    --do_eval \
    --use_habana \
    --use_peft True \
    --pipelining_fwd_bwd True\
    --use_lazy_mode \
    --throughput_warmup_steps 3 \
    --lora_r 4 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
    --max_seq_length 2048 \
    --adam_epsilon 3e-4 --use_flash_attention True --deepspeed /root/litang/QWEN/qwen2_finetune_bench_scripts/ds_config.json 2>&1 | tee log8x.txt

@ssarkar2
Copy link
Collaborator Author

ssarkar2 commented Jul 24, 2024

time python3 ../gaudi_spawn.py --use_deepspeed --world_size 8  sft.py  --model_name_or_path /mnt/weka/data/Qwen/Qwen2-72B/ --dataset_name "philschmid/dolly-15k-oai-style"  --streaming False  --bf16 True  --subset "''"  --output_dir ./model_qwen  --num_train_epochs 1  --per_device_train_batch_size 4  --evaluation_strategy "no"  --save_strategy "no"  --learning_rate 3e-4  --warmup_ratio  0.03  --lr_scheduler_type "cosine"  --max_grad_norm  0.3  --logging_steps 1  --do_train  --do_eval  --use_habana  --use_lazy_mode  --throughput_warmup_steps 3  --lora_r 4  --lora_alpha=16  --lora_dropout=0.05  --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj"  --max_seq_length 512  --adam_epsilon 1e-08 --packing False  --gradient_checkpointing True --pipelining_fwd_bwd True --deepspeed ds_config.json --max_steps 50
{'train_runtime': 485.8911, 'train_samples_per_second': 4.279, 'train_steps_per_second': 0.134, 'train_loss': 1.2488236403465272, 'epoch': 0.11, 'memory_allocated (GB)': 25.32, 'max_memory_allocated (GB)': 83.81, 'total_memory_available (GB)': 94.62}
100%|██████████| 50/50 [08:05<00:00,  9.72s/it]
***** train metrics *****
  epoch                       =     0.1121
  max_memory_allocated (GB)   =      83.81
  memory_allocated (GB)       =      25.32
  total_flos                  =    52339GF
  total_memory_available (GB) =      94.62
  train_loss                  =     1.2488
  train_runtime               = 0:08:05.89
  train_samples_per_second    =      4.279
  train_steps_per_second      =      0.134
100%|██████████| 12/12 [00:33<00:00,  2.83s/it]
***** eval metrics *****
  epoch                       =     0.1121
  eval_loss                   =     1.2832
  eval_runtime                = 0:00:36.53
  eval_samples                =        751
  eval_samples_per_second     =     21.228
  eval_steps_per_second       =      0.342
  max_memory_allocated (GB)   =      83.81
  memory_allocated (GB)       =      27.64
  perplexity                  =     3.6081
  total_memory_available (GB) =      94.62


time: 14m21.839s
{'train_runtime': 342.6403, 'train_samples_per_second': 6.969, 'train_steps_per_second': 0.218, 'train_loss': 1.2489857983589172, 'epoch': 0.11, 'memory_allocated (GB)': 25.9, 'max_memory_allocated (GB)': 93.88, 'total_memory_available (GB)': 94.62}
100%|██████████| 50/50 [05:42<00:00,  6.85s/it]
***** train metrics *****
  epoch                       =     0.1121
  max_memory_allocated (GB)   =      93.88
  memory_allocated (GB)       =       25.9
  total_flos                  =    66837GF
  total_memory_available (GB) =      94.62
  train_loss                  =      1.249
  train_runtime               = 0:05:42.64
  train_samples_per_second    =      6.969
  train_steps_per_second      =      0.218
100%|██████████| 12/12 [00:33<00:00,  2.80s/it]
***** eval metrics *****
  epoch                       =     0.1121
  eval_loss                   =     1.2833
  eval_runtime                = 0:00:36.79
  eval_samples                =        751
  eval_samples_per_second     =     21.827
  eval_steps_per_second       =      0.351
  max_memory_allocated (GB)   =      93.88
  memory_allocated (GB)       =      28.79
  perplexity                  =     3.6086
  total_memory_available (GB) =      94.62

time: 9m3.384s

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have some tests for sft.py in https://github.com/huggingface/optimum-habana/blob/main/tests/test_examples.py. Is it possible to integrate this one there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some extra command line args that I have:

--subset
--max_steps ( to keep the test short)
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj"
--lr_scheduler_type
--use_peft
--packing False
etc

Not sure how to fit these into test_examples.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to add any specific extra arguments in the extra_arguments field of the json baseline files, here is an example for Qwen:

"extra_arguments": [

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if that's good enough, otherwise we'll just keep the current version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisss
Ok, I can almost make test_examples.py work I think.

I have a variation in the cmd line:

if "72" in model_name:
        command += [
            "--max_steps",
            "50",
            "--gradient_checkpointing",
            "True",
            "--pipelining_fwd_bwd",
            "True",
        ] 
    else:
        command += ["--max_steps", "100"]

And also, I have a certain deepspeed config file, which I use for qwen72b model, which I create as a tmp file and dump and use

if "72" in model_name:
            command += ["--deepspeed", fp.name]

Any suggestions how these might be incorporated in test_examples.py? For the if-else in the cmd line based on 7b or 72b model, I could enter the qwen2-7 settings in Qwebn2_7B.json, and for 72b model I could create a new json Qwen2-72B.json ? However, MODELS_TO_TEST_MAPPING has entry: "qwen2": [("Qwen/Qwen2-7B", "Habana/qwen")],, so then I'd have to change that as well maybe?

just checking if you have some easy/obvious solution, else I'll stick to the test file I have now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the DeepSpeed config, you can also store it here and pass an argument like this one:

"--deepspeed tests/configs/deepspeed_zero_3_gaudi1.json"

But that's up to you, using a temporary file is fine too.

For the if-else in the cmd line based on 7b or 72b model, I could enter the qwen2-7 settings in Qwebn2_7B.json, and for 72b model I could create a new json Qwen2-72B.json ?

Yes, that sounds good!

However, MODELS_TO_TEST_MAPPING has entry: "qwen2": [("Qwen/Qwen2-7B", "Habana/qwen")],, so then I'd have to change that as well maybe?

Yep, you would probably need to change it to:

"qwen2": [("Qwen/Qwen2-7B", "Habana/qwen"), ("name_of_72B_checkpoint", "Habana/qwen")]

Then, we will also need to add a rule in this big conditional block to make sure the Qwen-72B is executed only for your test, but I can help with that at the end if you don't see how to do it.

optimum/habana/trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved
@ssarkar2
Copy link
Collaborator Author

ssarkar2 commented Jul 30, 2024

Trying with llama2-7b:

time python3 sft.py --model_name_or_path /mnt/weka/data/llama_inference/Llama-2-7b-hf/ --dataset_name "philschmid/dolly-15k-oai-style" --streaming False --bf16 True --subset '' --output_dir ./model_qwen --num_train_epochs 1 --per_device_train_batch_size 4 --evaluation_strategy "no" --save_strategy "no" --learning_rate 3e-4 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --max_grad_norm 0.3 --logging_steps 1 --do_train --do_eval --use_habana --use_lazy_mode --throughput_warmup_steps 3 --lora_r 4 --lora_alpha=16 --lora_dropout=0.05 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --max_seq_length 512 --adam_epsilon 1e-08 --packing False --num_bucket 8 --max_steps 100

without bucket:


{'train_runtime': 1424.0213, 'train_samples_per_second': 0.29, 'train_steps_per_second': 0.072, 'train_loss': 1.4798556953668593, 'epoch': 0.03, 'memory_allocated (GB)': 25.82, 'max_memory_allocated (GB)': 29.05, 'total_memory_available (GB)': 94.62}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [23:44<00:00, 14.24s/it]
***** train metrics *****
  epoch                       =     0.0281
  max_memory_allocated (GB)   =      29.05
  memory_allocated (GB)       =      25.82
  total_flos                  =  5601725GF
  total_memory_available (GB) =      94.62
  train_loss                  =     1.4799
  train_runtime               = 0:23:44.02
  train_samples_per_second    =       0.29
  train_steps_per_second      =      0.072
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 94/94 [00:42<00:00,  2.21it/s]
***** eval metrics *****
  epoch                       =     0.0281
  eval_loss                   =     1.5024
  eval_runtime                = 0:00:42.53
  eval_samples                =        751
  eval_samples_per_second     =     17.203
  eval_steps_per_second       =      2.153
  max_memory_allocated (GB)   =      30.86
  memory_allocated (GB)       =      25.82
  perplexity                  =     4.4926
  total_memory_available (GB) =      94.62

real    24m51.743s

bucket=8:

{'train_runtime': 141.7337, 'train_samples_per_second': 5.268, 'train_steps_per_second': 1.317, 'train_loss': 1.5400031226873399, 'epoch': 0.03, 'memory_allocate
d (GB)': 31.88, 'max_memory_allocated (GB)': 36.03, 'total_memory_available (GB)': 94.62}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:21<00:00,  1.42s/it]
***** train metrics *****
  epoch                       =     0.0281
  max_memory_allocated (GB)   =      36.03
  memory_allocated (GB)       =      31.88
  total_flos                  =  6437270GF
  total_memory_available (GB) =      94.62
  train_loss                  =       1.54
  train_runtime               = 0:02:21.73
  train_samples_per_second    =      5.268
  train_steps_per_second      =      1.317
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 94/94 [00:24<00:00,  3.88it/s]
***** eval metrics *****
  epoch                       =     0.0281
  eval_loss                   =     1.5041
  eval_runtime                = 0:00:24.26
  eval_samples                =        751
  eval_samples_per_second     =     30.233
  eval_steps_per_second       =      3.784
  max_memory_allocated (GB)   =      42.06
  memory_allocated (GB)       =      31.88
  perplexity                  =     4.5003
  total_memory_available (GB) =      94.62

real    3m11.791s

mounikamandava added a commit to emascarenhas/optimum-habana that referenced this pull request Aug 2, 2024
@libinta libinta removed the run-test Run CI for PRs from external contributors label Aug 3, 2024
@libinta libinta added the run-test Run CI for PRs from external contributors label Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants