Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlaMA-7B + LoRA在16GB的V100上OOM #53

Open
zhenqincn opened this issue Aug 30, 2023 · 2 comments
Open

LlaMA-7B + LoRA在16GB的V100上OOM #53

zhenqincn opened this issue Aug 30, 2023 · 2 comments

Comments

@zhenqincn
Copy link

尊敬的作者您好,我按照库中的配置,将per_device_train_batch_sizeper_device_eval_batch_size都设置为1,发现在单卡16GB的V100上运行lomo_lora_trainer.py训练LlaMA-7B会出现OOM的问题。

具体配置如下

# model
model_name_or_path: 'openlm-research/open_llama_7b'
# data
dataset_name: 'wic'
refresh: false
data_tag: 'base'
train_on_inputs: false
data_max_length: 1024
# training
# trainer
peft_type: 'lora'
lora_only: false
hf_learning_rate: 0.0005
hf_weight_decay: 0
hf_lr_scheduler_type: 'linear'
hf_warmup: 0.05
tag: 'lora-qv-r2-lomo'
output_dir: 'outputs'
overwrite_output_dir: true
deepspeed: 'config/ds_config_lora.json'
do_train: true
do_eval: true
evaluation_strategy: 'epoch'
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
learning_rate: 0.005
weight_decay: 0
num_train_epochs: 10
lr_scheduler_type: 'linear'
warmup: 0.05
clip_grad_norm: 1.0
#clip_grad_value: 1.0
#clip_loss_value: 5.0
log_level: 'info'
logging_steps: 1
# please set `resume_from_checkpoint` to load checkpoints. check `merge_llama_with_lora.py` first.
#resume_from_checkpoint: 'outputs/wic_7B_lora-qv-r2-lomo/output_lr0.005_bs16_warmup0.05_clipnorm1.0/checkpoint-0/merge_weights'
# please set `save_strategy` (`no`, `epoch`, `steps`) and `save_total_limit` (the max amount of checkpoints) to save checkpoints.
save_strategy: 'no'
save_total_limit: 0
seed: 42
#bf16: true
remove_unused_columns: false
load_best_model_at_end: false
metric_for_best_model: 'acc'
optim: 'sgd'
group_by_length: false
#report_to: 'wandb'
dataloader_pin_memory: false
gradient_checkpointing: true
predict_with_generate: false
lora_r: 2

顺便说一下,我在按照上述同样的配置,不用lora的情况下,在16GB的V100上通过LOMO训练LlaMA-7B将占用15933MB的显存,和论文中的结果似乎不太一样。请问是哪里我配置得不对吗?

@zhenqincn zhenqincn changed the title LlaMA-7B在16GB的V100上OOM LlaMA-7B + LoRA在16GB的V100上OOM Aug 30, 2023
@KaiLv69
Copy link
Collaborator

KaiLv69 commented Aug 31, 2023

你好,我在测试lomo+lora显存的时候使用的是3090,有24GB显存,一张卡就可以。V100可能需要两张。
论文里的显存是使用torch.cuda.memory_reserved()测试的,会比使用nvidia-smi等来监测少一点,是正常的。

@zhenqincn
Copy link
Author

非常感谢您的解答。
我看到您的论文的Table 2中提到,一个7B的模型在单卡3090上通过LOMO训练,占用显存13.61GB。感觉如果加上LoRA(r=2)不应该在16GB的卡上就直接OOM了,能否跟您请教下这个情况是否是正常的?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants