sasarkar/qwen finetuning bucketing #1130

ssarkar2 · 2024-07-09T20:38:41Z

What does this PR do?

Add bucketing to Qwen training

earlier version: #1128

results:

time python3 sft.py  --model_name_or_path /root/sasarkar/Qwen2_7B/ --dataset_name "philschmid/dolly-15k-oai-style"  --streaming False  --bf16 True  --subset ''  --output_dir ./model_qwen  --num_train_epochs 1  --per_device_train_batch_size 4  --evaluation_strategy "no"  --save_strategy "no"  --learning_rate 3e-4  --warmup_ratio  0.03  --lr_scheduler_type "cosine"  --max_grad_norm  0.3  --logging_steps 1  --do_train  --do_eval  --use_habana  --use_lazy_mode  --throughput_warmup_steps 3  --lora_r 4  --lora_alpha=16  --lora_dropout=0.05  --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj"  --max_seq_length 512  --adam_epsilon 1e-08 --packing False --num_bucket 8

{'train_runtime': 1218.7741, 'train_samples_per_second': 12.185, 'train_steps_per_second': 3.046, 'train_loss': 1.5069085590719007, 'epoch': 1.0, 'memory_allocated (GB)': 31.57, 'max_memory_allocated (GB)': 39.44, 'total_memory_available (GB)': 94.62}
***** train metrics *****
  epoch                       =         1.0
  max_memory_allocated (GB)   =       39.44
  memory_allocated (GB)       =       31.57
  total_flos                  = 230347401GF
  total_memory_available (GB) =       94.62
  train_loss                  =      1.5069
  train_runtime               =  0:20:18.77
  train_samples_per_second    =      12.185
  train_steps_per_second      =       3.046
100%|██████████| 94/94 [00:22<00:00,  4.13it/s]
***** eval metrics *****
  epoch                       =        1.0
  eval_loss                   =     1.5511
  eval_runtime                = 0:00:22.83
  eval_samples                =        751
  eval_samples_per_second     =     32.141
  eval_steps_per_second       =      4.023
  max_memory_allocated (GB)   =      75.66
  memory_allocated (GB)       =      31.57
  perplexity                  =     4.7165
  total_memory_available (GB) =      94.62

real    21m31.841s
user    35m22.686s
sys     42m57.211s

Numbers without this branch:

{'train_runtime': 8899.1643, 'train_samples_per_second': 1.611, 'train_steps_per_second': 0.403, 'train_loss': 1.5117962875673847, 'epoch': 1.0, 'memory_allocate
d (GB)': 27.85, 'max_memory_allocated (GB)': 34.25, 'total_memory_available (GB)': 94.62}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
██████████| 3565/3565 [2:28:19<00:00,  2.50s/it]
***** train metrics *****
  epoch                       =         1.0
  max_memory_allocated (GB)   =       34.25
  memory_allocated (GB)       =       27.85
  total_flos                  = 184625734GF
  total_memory_available (GB) =       94.62
  train_loss                  =      1.5118
  train_runtime               =  2:28:19.16
  train_samples_per_second    =       1.611
  train_steps_per_second      =       0.403
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
████████████████| 94/94 [00:49<00:00,  1.89it/s]
***** eval metrics *****
  epoch                       =        1.0
  eval_loss                   =     1.5515
  eval_runtime                = 0:00:49.63
  eval_samples                =        751
  eval_samples_per_second     =     14.713
  eval_steps_per_second       =      1.842
  max_memory_allocated (GB)   =      49.32
  memory_allocated (GB)       =      27.85
  perplexity                  =     4.7184
  total_memory_available (GB) =      94.62

real    150m3.097s
user    308m25.621s
sys     106m33.356s

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

HuggingFaceDocBuilderDev · 2024-07-09T20:42:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss · 2024-07-22T08:10:22Z

@ssarkar2 This will only work for Qwen? Or is it also appliable with other models?

ssarkar2 · 2024-07-23T03:41:09Z

@ssarkar2 This will only work for Qwen? Or is it also appliable with other models?

Haven't tried other models, but should work for others, since change is in SFT trainer

ssarkar2 · 2024-07-23T03:41:59Z

8x cmd:

time DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python3 ../gaudi_spawn.py --world_size 8 --use_deepspeed sft.py  --model_name_or_path /root/sasarkar/qwen2_7b/  --dataset_name "philschmid/dolly-15k-oai-style"  --streaming False  --bf16 True  --subset "''"  --output_dir ./model_qwen_70b  --num_train_epochs 1  --per_device_train_batch_size 8 --evaluation_strategy "no"  --save_strategy "no"  --learning_rate 3e-4  --warmup_ratio  0.03  --lr_scheduler_type "cosine"  --max_grad_norm  0.3  --logging_steps 1  --do_train  --do_eval  --use_habana  --use_lazy_mode  --throughput_warmup_steps 3  --lora_r 4  --lora_alpha=16  --lora_dropout=0.05  --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --max_seq_length 512  --adam_epsilon 1e-08 --packing False --num_bucket 8 --max_steps 100

{'train_runtime': 324.0838, 'train_samples_per_second': 30.12, 'train_steps_per_second': 0.471, 'train_loss': 1.5799240708351134, 'epoch': 0.45, 'memory_allocated (GB)': 39.92, 'max_memory_allocated (GB)': 84.57, 'total_memory_available (GB)': 94.62}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [05:24<00:00, 3.24s/it]
***** train metrics *****
epoch = 0.4484
max_memory_allocated (GB) = 84.57
memory_allocated (GB) = 39.92
total_flos = 135047763GF
total_memory_available (GB) = 94.62
train_loss = 1.5799
train_runtime = 0:05:24.08
train_samples_per_second = 30.12
train_steps_per_second = 0.471

***** eval metrics *****
epoch = 0.4484
eval_loss = 1.5758
eval_runtime = 0:00:15.60
eval_samples = 751
eval_samples_per_second = 193.217
eval_steps_per_second = 3.111
max_memory_allocated (GB) = 84.57
memory_allocated (GB) = 39.92
perplexity = 4.8347
total_memory_available (GB) = 94.62

time to run: 6m40.188s

withut bucket:
{'train_runtime': 2761.4812, 'train_samples_per_second': 2.327, 'train_steps_per_second': 0.036, 'train_loss': 1.5813338887691497, 'epoch': 0.45, 'memory_allocated (GB)': 39.09, 'max_memory_allocated (GB)': 83.52, 'total_memory_available (GB)': 94.62}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [46:01<00:00, 27.61s/it]
***** train metrics *****
epoch = 0.4484
max_memory_allocated (GB) = 83.52
memory_allocated (GB) = 39.09
total_flos = 102272664GF
total_memory_available (GB) = 94.62
train_loss = 1.5813
train_runtime = 0:46:01.48
train_samples_per_second = 2.327
train_steps_per_second = 0.036
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:34<00:00, 2.85s/it]
***** eval metrics *****
epoch = 0.4484
eval_loss = 1.5763
eval_runtime = 0:00:38.30
eval_samples = 751
eval_samples_per_second = 23.697
eval_steps_per_second = 0.382
max_memory_allocated (GB) = 83.52
memory_allocated (GB) = 39.09
perplexity = 4.8372
total_memory_available (GB) = 94.62

run time= 48min

ssarkar2 · 2024-07-23T18:49:46Z

TODO:
make qwen2-70b run using lora:

The command i was using earlier didnt have " --deepspeed " maybe that's why?

python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 sft.py \
    --model_name_or_path "/root//mnt/weka/data/Qwen/Qwen2-72B/" \
    --dataset_path "/root/litang/QWEN/qwen2_finetune_bench_scripts/belle_chat_ramdon_10k.json" \
    --bf16 True \
    --output_dir ./model_lora_qwen \
    --num_train_epochs 5 \
    --per_device_train_batch_size 10 \
    --gradient_accumulation_steps 8  \
    --gradient_checkpointing True \
    --evaluation_strategy "no" \
    --save_strategy "no" \
    --learning_rate 3e-4 \
    --warmup_ratio  0.03 \
    --lr_scheduler_type "cosine" \
    --max_grad_norm  0.3 \
    --logging_steps 1 \
    --do_train \
    --do_eval \
    --use_habana \
    --use_peft True \
    --pipelining_fwd_bwd True\
    --use_lazy_mode \
    --throughput_warmup_steps 3 \
    --lora_r 4 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
    --max_seq_length 2048 \
    --adam_epsilon 3e-4 --use_flash_attention True --deepspeed /root/litang/QWEN/qwen2_finetune_bench_scripts/ds_config.json 2>&1 | tee log8x.txt

ssarkar2 · 2024-07-24T04:23:00Z

time python3 ../gaudi_spawn.py --use_deepspeed --world_size 8  sft.py  --model_name_or_path /mnt/weka/data/Qwen/Qwen2-72B/ --dataset_name "philschmid/dolly-15k-oai-style"  --streaming False  --bf16 True  --subset "''"  --output_dir ./model_qwen  --num_train_epochs 1  --per_device_train_batch_size 4  --evaluation_strategy "no"  --save_strategy "no"  --learning_rate 3e-4  --warmup_ratio  0.03  --lr_scheduler_type "cosine"  --max_grad_norm  0.3  --logging_steps 1  --do_train  --do_eval  --use_habana  --use_lazy_mode  --throughput_warmup_steps 3  --lora_r 4  --lora_alpha=16  --lora_dropout=0.05  --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj"  --max_seq_length 512  --adam_epsilon 1e-08 --packing False  --gradient_checkpointing True --pipelining_fwd_bwd True --deepspeed ds_config.json --max_steps 50

{'train_runtime': 485.8911, 'train_samples_per_second': 4.279, 'train_steps_per_second': 0.134, 'train_loss': 1.2488236403465272, 'epoch': 0.11, 'memory_allocated (GB)': 25.32, 'max_memory_allocated (GB)': 83.81, 'total_memory_available (GB)': 94.62}
100%|██████████| 50/50 [08:05<00:00,  9.72s/it]
***** train metrics *****
  epoch                       =     0.1121
  max_memory_allocated (GB)   =      83.81
  memory_allocated (GB)       =      25.32
  total_flos                  =    52339GF
  total_memory_available (GB) =      94.62
  train_loss                  =     1.2488
  train_runtime               = 0:08:05.89
  train_samples_per_second    =      4.279
  train_steps_per_second      =      0.134
100%|██████████| 12/12 [00:33<00:00,  2.83s/it]
***** eval metrics *****
  epoch                       =     0.1121
  eval_loss                   =     1.2832
  eval_runtime                = 0:00:36.53
  eval_samples                =        751
  eval_samples_per_second     =     21.228
  eval_steps_per_second       =      0.342
  max_memory_allocated (GB)   =      83.81
  memory_allocated (GB)       =      27.64
  perplexity                  =     3.6081
  total_memory_available (GB) =      94.62


time: 14m21.839s

{'train_runtime': 342.6403, 'train_samples_per_second': 6.969, 'train_steps_per_second': 0.218, 'train_loss': 1.2489857983589172, 'epoch': 0.11, 'memory_allocated (GB)': 25.9, 'max_memory_allocated (GB)': 93.88, 'total_memory_available (GB)': 94.62}
100%|██████████| 50/50 [05:42<00:00,  6.85s/it]
***** train metrics *****
  epoch                       =     0.1121
  max_memory_allocated (GB)   =      93.88
  memory_allocated (GB)       =       25.9
  total_flos                  =    66837GF
  total_memory_available (GB) =      94.62
  train_loss                  =      1.249
  train_runtime               = 0:05:42.64
  train_samples_per_second    =      6.969
  train_steps_per_second      =      0.218
100%|██████████| 12/12 [00:33<00:00,  2.80s/it]
***** eval metrics *****
  epoch                       =     0.1121
  eval_loss                   =     1.2833
  eval_runtime                = 0:00:36.79
  eval_samples                =        751
  eval_samples_per_second     =     21.827
  eval_steps_per_second       =      0.351
  max_memory_allocated (GB)   =      93.88
  memory_allocated (GB)       =      28.79
  perplexity                  =     3.6086
  total_memory_available (GB) =      94.62

time: 9m3.384s

regisss · 2024-07-29T08:41:47Z

tests/test_sft_train.py

We already have some tests for sft.py in https://github.com/huggingface/optimum-habana/blob/main/tests/test_examples.py. Is it possible to integrate this one there?

some extra command line args that I have:

--subset
--max_steps ( to keep the test short)
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj"
--lr_scheduler_type
--use_peft
--packing False
etc

Not sure how to fit these into test_examples.py

You should be able to add any specific extra arguments in the extra_arguments field of the json baseline files, here is an example for Qwen:

optimum-habana/tests/baselines/Qwen2_7B.json

Line 49 in 59d182d

"extra_arguments": [

Let me know if that's good enough, otherwise we'll just keep the current version.

@regisss
Ok, I can almost make test_examples.py work I think.

I have a variation in the cmd line:

if "72" in model_name: command += [ "--max_steps", "50", "--gradient_checkpointing", "True", "--pipelining_fwd_bwd", "True", ] else: command += ["--max_steps", "100"]

And also, I have a certain deepspeed config file, which I use for qwen72b model, which I create as a tmp file and dump and use

if "72" in model_name: command += ["--deepspeed", fp.name]

Any suggestions how these might be incorporated in test_examples.py? For the if-else in the cmd line based on 7b or 72b model, I could enter the qwen2-7 settings in Qwebn2_7B.json, and for 72b model I could create a new json Qwen2-72B.json ? However, MODELS_TO_TEST_MAPPING has entry: "qwen2": [("Qwen/Qwen2-7B", "Habana/qwen")],, so then I'd have to change that as well maybe?

just checking if you have some easy/obvious solution, else I'll stick to the test file I have now

For the DeepSpeed config, you can also store it here and pass an argument like this one:

optimum-habana/tests/baselines/bloom_7b1.json

Line 17 in 7738595

"--deepspeed tests/configs/deepspeed_zero_3_gaudi1.json"

But that's up to you, using a temporary file is fine too.

For the if-else in the cmd line based on 7b or 72b model, I could enter the qwen2-7 settings in Qwebn2_7B.json, and for 72b model I could create a new json Qwen2-72B.json ?

Yes, that sounds good!

However, MODELS_TO_TEST_MAPPING has entry: "qwen2": [("Qwen/Qwen2-7B", "Habana/qwen")],, so then I'd have to change that as well maybe?

Yep, you would probably need to change it to:

"qwen2": [("Qwen/Qwen2-7B", "Habana/qwen"), ("name_of_72B_checkpoint", "Habana/qwen")]

Then, we will also need to add a rule in this big conditional block to make sure the Qwen-72B is executed only for your test, but I can help with that at the end if you don't see how to do it.

optimum/habana/trl/trainer/sft_trainer.py

ssarkar2 · 2024-07-30T18:03:21Z

Trying with llama2-7b:

time python3 sft.py --model_name_or_path /mnt/weka/data/llama_inference/Llama-2-7b-hf/ --dataset_name "philschmid/dolly-15k-oai-style" --streaming False --bf16 True --subset '' --output_dir ./model_qwen --num_train_epochs 1 --per_device_train_batch_size 4 --evaluation_strategy "no" --save_strategy "no" --learning_rate 3e-4 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --max_grad_norm 0.3 --logging_steps 1 --do_train --do_eval --use_habana --use_lazy_mode --throughput_warmup_steps 3 --lora_r 4 --lora_alpha=16 --lora_dropout=0.05 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --max_seq_length 512 --adam_epsilon 1e-08 --packing False --num_bucket 8 --max_steps 100

without bucket:


{'train_runtime': 1424.0213, 'train_samples_per_second': 0.29, 'train_steps_per_second': 0.072, 'train_loss': 1.4798556953668593, 'epoch': 0.03, 'memory_allocated (GB)': 25.82, 'max_memory_allocated (GB)': 29.05, 'total_memory_available (GB)': 94.62}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [23:44<00:00, 14.24s/it]
***** train metrics *****
  epoch                       =     0.0281
  max_memory_allocated (GB)   =      29.05
  memory_allocated (GB)       =      25.82
  total_flos                  =  5601725GF
  total_memory_available (GB) =      94.62
  train_loss                  =     1.4799
  train_runtime               = 0:23:44.02
  train_samples_per_second    =       0.29
  train_steps_per_second      =      0.072
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 94/94 [00:42<00:00,  2.21it/s]
***** eval metrics *****
  epoch                       =     0.0281
  eval_loss                   =     1.5024
  eval_runtime                = 0:00:42.53
  eval_samples                =        751
  eval_samples_per_second     =     17.203
  eval_steps_per_second       =      2.153
  max_memory_allocated (GB)   =      30.86
  memory_allocated (GB)       =      25.82
  perplexity                  =     4.4926
  total_memory_available (GB) =      94.62

real    24m51.743s

bucket=8:

{'train_runtime': 141.7337, 'train_samples_per_second': 5.268, 'train_steps_per_second': 1.317, 'train_loss': 1.5400031226873399, 'epoch': 0.03, 'memory_allocate
d (GB)': 31.88, 'max_memory_allocated (GB)': 36.03, 'total_memory_available (GB)': 94.62}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:21<00:00,  1.42s/it]
***** train metrics *****
  epoch                       =     0.0281
  max_memory_allocated (GB)   =      36.03
  memory_allocated (GB)       =      31.88
  total_flos                  =  6437270GF
  total_memory_available (GB) =      94.62
  train_loss                  =       1.54
  train_runtime               = 0:02:21.73
  train_samples_per_second    =      5.268
  train_steps_per_second      =      1.317
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 94/94 [00:24<00:00,  3.88it/s]
***** eval metrics *****
  epoch                       =     0.0281
  eval_loss                   =     1.5041
  eval_runtime                = 0:00:24.26
  eval_samples                =        751
  eval_samples_per_second     =     30.233
  eval_steps_per_second       =      3.784
  max_memory_allocated (GB)   =      42.06
  memory_allocated (GB)       =      31.88
  perplexity                  =     4.5003
  total_memory_available (GB) =      94.62

real    3m11.791s

sasarkar/qwen finetuning bucketing huggingface#1130

Initial commit

a2532e2

ssarkar2 requested a review from regisss as a code owner July 9, 2024 20:38

ssarkar2 changed the title ~~sasarkar/qwen bucketing~~ sasarkar/qwen finetuning bucketing Jul 9, 2024

ssarkar2 mentioned this pull request Jul 9, 2024

Sarkar/qwen ft #1128

Closed

3 tasks

cleanup

844a52b

libinta added the run-test Run CI for PRs from external contributors label Jul 16, 2024

ssarkar2 and others added 3 commits July 23, 2024 06:57

Add test

edce949

Update test_sft_train.py

cadbda5

Update test_sft_train.py

2e28658

regisss reviewed Jul 29, 2024

View reviewed changes

Parameterize test for both 7b and 72b model

ea98a17

ssarkar2 force-pushed the sarkar/qwen_ft_2 branch from 24c8941 to ea98a17 Compare July 30, 2024 06:12

ssarkar2 added 4 commits July 29, 2024 23:14

Style

edfc7f5

Address comment

78471a8

Fix fp.name

81b26a0

Style

0168cd3

Remove stray todo

cf6b1b6

mounikamandava added a commit to emascarenhas/optimum-habana that referenced this pull request Aug 2, 2024

Merge branch 'sarkar/qwen_ft_2' into syn1.17tr4.43

364e061

sasarkar/qwen finetuning bucketing huggingface#1130

libinta removed the run-test Run CI for PRs from external contributors label Aug 3, 2024

Merge branch 'main' into sarkar/qwen_ft_2

6a72347

libinta added the run-test Run CI for PRs from external contributors label Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sasarkar/qwen finetuning bucketing #1130

sasarkar/qwen finetuning bucketing #1130

ssarkar2 commented Jul 9, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 9, 2024

regisss commented Jul 22, 2024

ssarkar2 commented Jul 23, 2024

ssarkar2 commented Jul 23, 2024 •

edited

Loading

ssarkar2 commented Jul 23, 2024

ssarkar2 commented Jul 24, 2024 •

edited

Loading

regisss Jul 29, 2024

ssarkar2 Jul 30, 2024

regisss Jul 30, 2024

regisss Jul 30, 2024

ssarkar2 Aug 1, 2024

regisss Aug 2, 2024

ssarkar2 commented Jul 30, 2024 •

edited

Loading

sasarkar/qwen finetuning bucketing #1130

Are you sure you want to change the base?

sasarkar/qwen finetuning bucketing #1130

Conversation

ssarkar2 commented Jul 9, 2024 • edited Loading

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Jul 9, 2024

regisss commented Jul 22, 2024

ssarkar2 commented Jul 23, 2024

ssarkar2 commented Jul 23, 2024 • edited Loading

ssarkar2 commented Jul 23, 2024

ssarkar2 commented Jul 24, 2024 • edited Loading

regisss Jul 29, 2024

Choose a reason for hiding this comment

ssarkar2 Jul 30, 2024

Choose a reason for hiding this comment

regisss Jul 30, 2024

Choose a reason for hiding this comment

regisss Jul 30, 2024

Choose a reason for hiding this comment

ssarkar2 Aug 1, 2024

Choose a reason for hiding this comment

regisss Aug 2, 2024

Choose a reason for hiding this comment

ssarkar2 commented Jul 30, 2024 • edited Loading

ssarkar2 commented Jul 9, 2024 •

edited

Loading

ssarkar2 commented Jul 23, 2024 •

edited

Loading

ssarkar2 commented Jul 24, 2024 •

edited

Loading

ssarkar2 commented Jul 30, 2024 •

edited

Loading