-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon 7th PPSCI No.12】Adam、AdamW 优化器支持 amsgrad -part #68079
base: develop
Are you sure you want to change the base?
Conversation
… hack7_amsgrad
你的PR提交成功,感谢你对开源项目的贡献! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
添加的ams_grad是否会影响原有的代码执行逻辑和存储空间占用情况?PR的代码起来无论是否开启ams_grad,都会比原先没有amsgrad的代码多申请一段mom2_max的空间,以及有一些多余的变量产生。
|
||
inline HOSTDEVICE void operator()(size_t i) const { | ||
// Merge all memory access together. | ||
T g = grad_[i]; | ||
T mom1 = moment1_[i]; | ||
T mom2 = moment2_[i]; | ||
T mom2_max = moment2_max_[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是必须要记录的吗?
T mom2_max_; | ||
if (amsgrad_) { | ||
mom2_max_ = std::max(mom2, mom2_max); | ||
p -= lr * (mom1 / (sqrt(mom2_max_) + epsilon_ * sqrt(1 - beta2_pow))); | ||
} else { | ||
mom2_max_ = mom2_max; | ||
p -= lr * (mom1 / (sqrt(mom2) + epsilon_ * sqrt(1 - beta2_pow))); | ||
} | ||
|
||
// Write back to global memory | ||
moment1_out_[i] = mom1; | ||
moment2_out_[i] = mom2; | ||
moment2_max_out_[i] = mom2_max_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同理,如果amsgrad没有开启,建议不要添加任何多余的变量和相关计算逻辑,保持原样即可
Eigen::Map<Eigen::Array<T, 1, Eigen::Dynamic>> moment2_max_out{ | ||
moment2_max_out_, static_cast<Eigen::Index>(numel)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上,如果没有开启amsgrad,是否会有mom2_max相关的冗余运算和存储占用?
|
||
inline HOSTDEVICE void adam_update(size_t i, T g) const { | ||
// The following code is the same as dense | ||
T mom1 = moment1_[i]; | ||
T mom2 = moment2_[i]; | ||
T mom2_max = moment2_max_[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
@@ -14,6 +14,7 @@ | |||
|
|||
#pragma once | |||
|
|||
#include <stdio.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个头文件是什么有代码依赖吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
调试之后忘记删掉了,抱歉 ~
python/paddle/optimizer/adam.py
Outdated
@@ -117,6 +117,7 @@ class Adam(Optimizer): | |||
The default value is False. | |||
multi_precision (bool, optional): Whether to use multi-precision during weight updating. Default is false. | |||
use_multi_tensor (bool, optional): Whether to use multi-tensor strategy to update all parameters at once . Default is false. | |||
amsgrad (bool, optional): Whether to use the AMSGrad of this algorithm. Default is false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/paddle/optimizer/adamw.py
Outdated
@@ -104,6 +104,7 @@ class AdamW(Optimizer): | |||
different semantics with the original Adam algorithm and may lead to different result. | |||
The default value is False. | |||
multi_precision (bool, optional): Whether to use multi-precision during weight updating. Default is false. | |||
amsgrad (bool, optional): Whether to use the AMSGrad of this algorithm. Default is false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
… hack7_amsgrad
这个之前考虑过,主要是因为,目前涉及到 amsgrad 的地方太多了,所以优化相关的事情想先往后放一下 ~ 那我现在改一下试试吧 ~ |
|
另外可以在修改完成后,用ResNet50或者其他模型,以fake data为输入做一个对比,确认下amsgrad关闭时,显存无变化,开启时显存增加量与参数量基本相同。 |
Sorry to inform you that d157301's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
… hack7_amsgrad
内部确认了一下,覆盖率尤其是CPU的算子覆盖率由于策略的问题,可能会无法检测到覆盖代码。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不好意思忘记submit了,我们内部测了一下发现,对于_C_ops.adam(w).(...)这种调用会有影响,可能需要把新增的参数暴露顺序放至最后,否则是不兼容升级
我在 PaddlePaddle/PLSC#216 (comment) 已经回复了具体的修改方法,其实就是按照位置加上 直接用 另外, |
另外,补充一点,如果希望兼容旧版本 ADAM_WITH_AMSGRAD = hasattr(paddle.optimizer.Adam, '_moment2_acc_max_str')
...
def foo():
...
if ADAM_WITH_AMSGRAD:
_ = _C_ops.adam_(..., None, ..., 'amsgrad', False)
else:
_ = _C_ops.adam_(...)
... 先判断是否是 |
Sorry to inform you that 1c05064's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
@megemini 我们内部测试在llama上对显存有影响,会让显存变大。能否麻烦使用paddlenlp的这个教程,测一下llama-7b在显存上是否有影响呢?https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#22-(将 |
… hack7_amsgrad
具体怎么测试的?大多少?之前咱们验证过显存这个问题,如果不使用 我搜了 paddlenlp 里面的实现方式,都是调用的 我拉一下最新的版本再编译一下看看 ~ https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#22- 这个里面的测试环境我试试看 aistudio 能不能复现,不确定 aistudio 能否放的下 ... ... 🙏🙏🙏 |
除了resnet,麻烦辛苦再用llama2-7b再验证一下吧,看下显存前后占用变化就行 |
@HydrogenSulfate 这两天事情比较多,刚抽空测了下 PaddleNLP 中 llama 关于显存占用的情况 ~ 先说结论:AIStudio 上编译的带有 AIStudio 上面我是用的双卡的环境测试的,命令如下: python -u -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/llama/sft_argument.json 配置文件 {
"model_name_or_path": "meta-llama/Llama-2-7b",
"dataset_name_or_path": "/home/aistudio/llama/data",
"output_dir": "/home/aistudio/llama/checkpoints",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 2,
"per_device_eval_batch_size": 8,
"eval_accumulation_steps":16,
"num_train_epochs": 1,
"learning_rate": 3e-05,
"warmup_steps": 30,
"logging_steps": 1,
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"src_length": 1024,
"max_length": 2048,
"bf16": true,
"fp16_opt_level": "O2",
"do_train": true,
"do_eval": true,
"disable_tqdm": true,
"load_best_model_at_end": true,
"eval_with_do_generation": false,
"metric_for_best_model": "accuracy",
"recompute": false,
"save_total_limit": 1,
"tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
"pipeline_parallel_config": "disable_p2p_cache_shape",
"sharding": "stage2",
"zero_padding": false,
"unified_checkpoint": true,
"use_flash_attention": false
}
这里只修改了模型名称与存储目录。 运行命令后,输出的日志如下: aistudio@jupyter-942478-8345123:~/PaddleNLP/llm$ python -u -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/llama/sft_argument.json
LAUNCH INFO 2024-11-01 06:45:05,596 ----------- Configuration ----------------------
LAUNCH INFO 2024-11-01 06:45:05,596 auto_cluster_config: 0
LAUNCH INFO 2024-11-01 06:45:05,596 auto_parallel_config: None
LAUNCH INFO 2024-11-01 06:45:05,596 auto_tuner_json: None
LAUNCH INFO 2024-11-01 06:45:05,596 devices: 0,1
LAUNCH INFO 2024-11-01 06:45:05,596 elastic_level: -1
LAUNCH INFO 2024-11-01 06:45:05,596 elastic_timeout: 30
LAUNCH INFO 2024-11-01 06:45:05,596 enable_gpu_log: True
LAUNCH INFO 2024-11-01 06:45:05,596 gloo_port: 6767
LAUNCH INFO 2024-11-01 06:45:05,596 host: None
LAUNCH INFO 2024-11-01 06:45:05,596 ips: None
LAUNCH INFO 2024-11-01 06:45:05,596 job_id: default
LAUNCH INFO 2024-11-01 06:45:05,596 legacy: False
LAUNCH INFO 2024-11-01 06:45:05,596 log_dir: log
LAUNCH INFO 2024-11-01 06:45:05,596 log_level: INFO
LAUNCH INFO 2024-11-01 06:45:05,596 log_overwrite: False
LAUNCH INFO 2024-11-01 06:45:05,596 master: None
LAUNCH INFO 2024-11-01 06:45:05,596 max_restart: 3
LAUNCH INFO 2024-11-01 06:45:05,596 nnodes: 1
LAUNCH INFO 2024-11-01 06:45:05,597 nproc_per_node: None
LAUNCH INFO 2024-11-01 06:45:05,597 rank: -1
LAUNCH INFO 2024-11-01 06:45:05,597 run_mode: collective
LAUNCH INFO 2024-11-01 06:45:05,597 server_num: None
LAUNCH INFO 2024-11-01 06:45:05,597 servers:
LAUNCH INFO 2024-11-01 06:45:05,597 sort_ip: False
LAUNCH INFO 2024-11-01 06:45:05,597 start_port: 6070
LAUNCH INFO 2024-11-01 06:45:05,597 trainer_num: None
LAUNCH INFO 2024-11-01 06:45:05,597 trainers:
LAUNCH INFO 2024-11-01 06:45:05,597 training_script: run_finetune.py
LAUNCH INFO 2024-11-01 06:45:05,597 training_script_args: ['./config/llama/sft_argument.json']
LAUNCH INFO 2024-11-01 06:45:05,597 with_gloo: 1
LAUNCH INFO 2024-11-01 06:45:05,597 --------------------------------------------------
LAUNCH INFO 2024-11-01 06:45:05,597 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-11-01 06:45:05,612 Run Pod: vvsghr, replicas 2, status ready
LAUNCH INFO 2024-11-01 06:45:05,720 Watching Pod: vvsghr, replicas 2, status running
/home/aistudio/.local/lib/python3.8/site-packages/_distutils_hack/__init__.py:31: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
warnings.warn(
[2024-11-01 06:45:09,243] [ INFO] distributed_strategy.py:333 - distributed strategy initialized
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
I1101 06:45:09.245688 57051 tcp_utils.cc:181] The server starts to listen on IP_ANY:50236
I1101 06:45:09.245944 57051 tcp_utils.cc:130] Successfully connected to 10.44.3.96:50236
I1101 06:45:09.333040 57051 process_group_nccl.cc:151] ProcessGroupNCCL pg_timeout_ 1800000
I1101 06:45:09.333137 57051 process_group_nccl.cc:152] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024-11-01 06:45:09,333] [ INFO] topology.py:375 - Total 2 pipe comm group(s) create successfully!
W1101 06:45:09.334441 57051 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.0, Runtime API Version: 11.8
W1101 06:45:09.335824 57051 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/communication/group.py:128: UserWarning: Current global rank 0 is not in group _default_pg10
warnings.warn(
[2024-11-01 06:45:11,712] [ INFO] topology.py:375 - Total 2 data comm group(s) create successfully!
[2024-11-01 06:45:11,712] [ INFO] topology.py:375 - Total 2 model comm group(s) create successfully!
I1101 06:45:11.712544 57051 process_group_nccl.cc:151] ProcessGroupNCCL pg_timeout_ 1800000
I1101 06:45:11.712577 57051 process_group_nccl.cc:152] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024-11-01 06:45:11,712] [ INFO] topology.py:375 - Total 1 sharding comm group(s) create successfully!
I1101 06:45:11.712703 57051 process_group_nccl.cc:151] ProcessGroupNCCL pg_timeout_ 1800000
I1101 06:45:11.712713 57051 process_group_nccl.cc:152] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024-11-01 06:45:11,712] [ INFO] topology.py:295 - HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 2, pp_degree: 1, dp_degree: 1, sep_degree: 1, mp_group: [0], sharding_group: [0, 1], pp_group: [0], dp_group: [0], sep:group: None, check/clip group: [0, 1]
[2024-11-01 06:45:11,712] [ INFO] - +==============================================================================+
| |
| DistributedStrategy Overview |
| |
+==============================================================================+
| a_sync=True <-> a_sync_configs |
+------------------------------------------------------------------------------+
| k_steps -1 |
| max_merge_var_num 1 |
| send_queue_size 16 |
| independent_recv_thread False |
| min_send_grad_num_before_recv 1 |
| thread_pool_size 1 |
| send_wait_times 1 |
| runtime_split_send_recv False |
| launch_barrier True |
| heter_worker_device_guard cpu |
| lr_decay_steps 10 |
| use_ps_gpu 0 |
| use_gpu_graph 0 |
+==============================================================================+
| Environment Flags, Communication Flags |
+------------------------------------------------------------------------------+
| mode 1 |
| elastic False |
| auto False |
| sync_nccl_allreduce True |
| nccl_comm_num 1 |
| use_hierarchical_allreduce False |
| hierarchical_allreduce_inter_nranks 1 |
| sync_batch_norm False |
| fuse_all_reduce_ops True |
| fuse_grad_size_in_MB 32 |
| fuse_grad_size_in_TFLOPS 50.0 |
| cudnn_exhaustive_search False |
| conv_workspace_size_limit 512 |
| cudnn_batchnorm_spatial_persistent False |
| fp16_allreduce False |
| last_comm_group_size_MB 1.0 |
| find_unused_parameters False |
| without_graph_optimization True |
| fuse_grad_size_in_num 8 |
| calc_comm_same_stream False |
| asp False |
| fuse_grad_merge False |
| semi_auto False |
| adam_d2sum False |
| auto_search False |
| heter_ccl_mode False |
| is_fl_ps_mode False |
| with_coordinator False |
| split_data True |
| downpour_table_param [] |
| fs_client_param |
+==============================================================================+
| Build Strategy |
+------------------------------------------------------------------------------+
| fuse_elewise_add_act_ops False |
| fuse_bn_act_ops False |
| fuse_relu_depthwise_conv False |
| fuse_broadcast_ops False |
| fuse_all_optimizer_ops False |
| enable_inplace False |
| enable_backward_optimizer_op_deps True |
| cache_runtime_context False |
| fuse_bn_add_act_ops True |
| enable_auto_fusion False |
| enable_addto False |
| allow_cuda_graph_capture False |
| reduce_strategy 0 |
| fuse_gemm_epilogue False |
| debug_graphviz_path |
| fused_attention False |
| fused_feedforward False |
| fuse_dot_product_attention False |
| fuse_resunit False |
+==============================================================================+
[2024-11-01 06:45:11,713] [ INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2024-11-01 06:45:11,713] [ DEBUG] - ============================================================
[2024-11-01 06:45:11,713] [ DEBUG] - Model Configuration Arguments
[2024-11-01 06:45:11,713] [ DEBUG] - paddle commit id : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,713] [ DEBUG] - paddlenlp commit id : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,714] [ DEBUG] - aistudio_repo_id : None
[2024-11-01 06:45:11,714] [ DEBUG] - aistudio_repo_license : Apache License 2.0
[2024-11-01 06:45:11,714] [ DEBUG] - aistudio_repo_private : True
[2024-11-01 06:45:11,714] [ DEBUG] - aistudio_token : None
[2024-11-01 06:45:11,714] [ DEBUG] - attention_probs_dropout_prob : 0.1
[2024-11-01 06:45:11,714] [ DEBUG] - continue_training : True
[2024-11-01 06:45:11,714] [ DEBUG] - flash_mask : False
[2024-11-01 06:45:11,714] [ DEBUG] - from_aistudio : False
[2024-11-01 06:45:11,714] [ DEBUG] - fuse_attention_ffn : None
[2024-11-01 06:45:11,714] [ DEBUG] - fuse_attention_qkv : None
[2024-11-01 06:45:11,714] [ DEBUG] - hidden_dropout_prob : 0.1
[2024-11-01 06:45:11,714] [ DEBUG] - lora : False
[2024-11-01 06:45:11,714] [ DEBUG] - lora_path : None
[2024-11-01 06:45:11,714] [ DEBUG] - lora_plus_scale : 1.0
[2024-11-01 06:45:11,714] [ DEBUG] - lora_rank : 8
[2024-11-01 06:45:11,714] [ DEBUG] - model_name_or_path : meta-llama/Llama-2-7b
[2024-11-01 06:45:11,715] [ DEBUG] - neftune : False
[2024-11-01 06:45:11,715] [ DEBUG] - neftune_noise_alpha : 5.0
[2024-11-01 06:45:11,715] [ DEBUG] - num_prefix_tokens : 128
[2024-11-01 06:45:11,715] [ DEBUG] - pissa : False
[2024-11-01 06:45:11,715] [ DEBUG] - prefix_path : None
[2024-11-01 06:45:11,715] [ DEBUG] - prefix_tuning : False
[2024-11-01 06:45:11,715] [ DEBUG] - rslora : False
[2024-11-01 06:45:11,715] [ DEBUG] - save_to_aistudio : False
[2024-11-01 06:45:11,715] [ DEBUG] - tokenizer_name_or_path : None
[2024-11-01 06:45:11,715] [ DEBUG] - use_fast_layer_norm : False
[2024-11-01 06:45:11,715] [ DEBUG] - use_quick_lora : False
[2024-11-01 06:45:11,715] [ DEBUG] - vera : False
[2024-11-01 06:45:11,715] [ DEBUG] - vera_rank : 8
[2024-11-01 06:45:11,715] [ DEBUG] - weight_blocksize : 64
[2024-11-01 06:45:11,715] [ DEBUG] - weight_double_quant : False
[2024-11-01 06:45:11,715] [ DEBUG] - weight_double_quant_block_size: 256
[2024-11-01 06:45:11,716] [ DEBUG] - weight_quantize_algo : None
[2024-11-01 06:45:11,716] [ DEBUG] -
[2024-11-01 06:45:11,716] [ DEBUG] - ============================================================
[2024-11-01 06:45:11,716] [ DEBUG] - Data Configuration Arguments
[2024-11-01 06:45:11,716] [ DEBUG] - paddle commit id : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,716] [ DEBUG] - paddlenlp commit id : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,716] [ DEBUG] - chat_template : None
[2024-11-01 06:45:11,716] [ DEBUG] - dataset_name_or_path : /home/aistudio/llama/data
[2024-11-01 06:45:11,716] [ DEBUG] - eval_with_do_generation : False
[2024-11-01 06:45:11,716] [ DEBUG] - greedy_zero_padding : False
[2024-11-01 06:45:11,716] [ DEBUG] - intokens : None
[2024-11-01 06:45:11,716] [ DEBUG] - lazy : False
[2024-11-01 06:45:11,716] [ DEBUG] - max_length : 2048
[2024-11-01 06:45:11,716] [ DEBUG] - pad_to_max_length : False
[2024-11-01 06:45:11,716] [ DEBUG] - pad_to_multiple_of : None
[2024-11-01 06:45:11,716] [ DEBUG] - save_generation_output : False
[2024-11-01 06:45:11,717] [ DEBUG] - src_length : 1024
[2024-11-01 06:45:11,717] [ DEBUG] - task_name : None
[2024-11-01 06:45:11,717] [ DEBUG] - task_name_or_path : None
[2024-11-01 06:45:11,717] [ DEBUG] - zero_padding : False
[2024-11-01 06:45:11,717] [ DEBUG] -
[2024-11-01 06:45:11,717] [ DEBUG] - ============================================================
[2024-11-01 06:45:11,717] [ DEBUG] - Quant Configuration Arguments
[2024-11-01 06:45:11,717] [ DEBUG] - paddle commit id : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,717] [ DEBUG] - paddlenlp commit id : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,717] [ DEBUG] - act_quant_method : avg
[2024-11-01 06:45:11,717] [ DEBUG] - auto_clip : False
[2024-11-01 06:45:11,717] [ DEBUG] - autoclip_step : 8
[2024-11-01 06:45:11,717] [ DEBUG] - awq_step : 8
[2024-11-01 06:45:11,717] [ DEBUG] - cachekv_quant_method : avg_headwise
[2024-11-01 06:45:11,717] [ DEBUG] - do_awq : False
[2024-11-01 06:45:11,717] [ DEBUG] - do_gptq : False
[2024-11-01 06:45:11,718] [ DEBUG] - do_ptq : False
[2024-11-01 06:45:11,718] [ DEBUG] - do_qat : False
[2024-11-01 06:45:11,718] [ DEBUG] - do_quant_debug : False
[2024-11-01 06:45:11,718] [ DEBUG] - fp8_type : ['e4m3', 'e4m3']
[2024-11-01 06:45:11,718] [ DEBUG] - gptq_step : 8
[2024-11-01 06:45:11,718] [ DEBUG] - load_quant_model : False
[2024-11-01 06:45:11,718] [ DEBUG] - ptq_step : 32
[2024-11-01 06:45:11,718] [ DEBUG] - quant_type : a8w8
[2024-11-01 06:45:11,718] [ DEBUG] - search_alpha_max : 0.8
[2024-11-01 06:45:11,718] [ DEBUG] - search_alpha_min : 0.2
[2024-11-01 06:45:11,718] [ DEBUG] - search_scale_max : 5.0
[2024-11-01 06:45:11,718] [ DEBUG] - search_scale_min : 1.0
[2024-11-01 06:45:11,718] [ DEBUG] - shift : False
[2024-11-01 06:45:11,718] [ DEBUG] - shift_all_linears : False
[2024-11-01 06:45:11,718] [ DEBUG] - shift_sampler : ema
[2024-11-01 06:45:11,718] [ DEBUG] - shift_step : 32
[2024-11-01 06:45:11,718] [ DEBUG] - skip_list_names : None
[2024-11-01 06:45:11,719] [ DEBUG] - smooth : False
[2024-11-01 06:45:11,719] [ DEBUG] - smooth_all_linears : False
[2024-11-01 06:45:11,719] [ DEBUG] - smooth_k_piece : 3
[2024-11-01 06:45:11,719] [ DEBUG] - smooth_piecewise_search : False
[2024-11-01 06:45:11,719] [ DEBUG] - smooth_sampler : none
[2024-11-01 06:45:11,719] [ DEBUG] - smooth_search_piece : False
[2024-11-01 06:45:11,719] [ DEBUG] - smooth_step : 32
[2024-11-01 06:45:11,719] [ DEBUG] - test_sample : None
[2024-11-01 06:45:11,719] [ DEBUG] - weight_quant_method : abs_max_channel_wise
[2024-11-01 06:45:11,719] [ DEBUG] -
[2024-11-01 06:45:11,719] [ DEBUG] - ============================================================
[2024-11-01 06:45:11,719] [ DEBUG] - Generation Configuration Arguments
[2024-11-01 06:45:11,719] [ DEBUG] - paddle commit id : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,719] [ DEBUG] - paddlenlp commit id : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,719] [ DEBUG] - top_k : 1
[2024-11-01 06:45:11,719] [ DEBUG] - top_p : 1.0
[2024-11-01 06:45:11,720] [ DEBUG] -
[2024-11-01 06:45:11,720] [ WARNING] - Process rank: 0, device: gpu, world_size: 2, distributed training: True, 16-bits training: True
[2024-11-01 06:45:11,721] [ INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 06:45:11,721] [ INFO] - Loading configuration file /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/config.json
[2024-11-01 06:45:11,723] [ INFO] - Final model config: LlamaConfig {
"alibi": false,
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"dtype": "bfloat16",
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"immediate_clear_past_key_value": false,
"initializer_range": 0.02,
"intermediate_size": 11008,
"long_sequence_init_args": {},
"long_sequence_strategy_name": null,
"long_sequence_strategy_type": null,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pad_token_id": 0,
"paddlenlp_version": "3.0.0b2.post20241101",
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_scaling_factor": 1.0,
"rope_scaling_type": null,
"rope_theta": 10000.0,
"seq_length": 2048,
"tensor_parallel_output": false,
"tie_word_embeddings": false,
"use_fast_layer_norm": false,
"use_flash_attention_for_generation": false,
"use_last_token_for_generation": false,
"use_long_sequence_strategies": false,
"vocab_size": 32000
}
[2024-11-01 06:45:11,723] [ INFO] - We are using <class 'paddlenlp.transformers.llama.modeling.LlamaForCausalLM'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 06:45:11,724] [ INFO] - Loading weights file from cache at /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/model_state.pdparams
以下为截图:
其中, 说明:程序没有再往下执行,PaddleNLP 报错,有可能是 PaddleNLP 程序有问题,没有正确的读取输入数据的文件 LAUNCH INFO 2024-11-01 06:38:19,474 Pod failed
LAUNCH ERROR 2024-11-01 06:38:19,474 Container failed !!!
Container rank 1 status failed cmd ['/usr/bin/python', '-u', 'run_finetune.py', './config/llama/sft_argument.json'] code 1 log log/workerlog.1
LAUNCH INFO 2024-11-01 06:38:19,474 ------------------------- ERROR LOG DETAIL -------------------------
ing <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load 'meta-llama/Llama-2-7b'.
Traceback (most recent call last):
File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 194, in load_dataset
reader_cls = import_main_class(path_or_read_func)
File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 95, in import_main_class
module = importlib.import_module(module_path)
File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'paddlenlp.datasets./home/aistudio/llama/data'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 116, in load_from_hf
hf_datasets = load_hf_dataset(path, name=name, split=splits, **kwargs)
File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 56, in load_from_ppnlp
return origin_load_dataset(path, trust_remote_code=True, *args, **kwargs)
File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 2132, in load_dataset
builder_instance = load_dataset_builder(
File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 1853, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 1582, in dataset_module_factory
return LocalDatasetModuleFactoryWithoutScript(
File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 840, in get_module
module_name, default_builder_kwargs = infer_module_for_data_files(
File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 601, in infer_module_for_data_files
raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
datasets.exceptions.DataFilesNotFoundError: No (supported) data files found in /home/aistudio/llama/data
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_finetune.py", line 735, in <module>
main()
File "run_finetune.py", line 313, in main
train_ds = load_dataset(data_args.dataset_name_or_path, splits=["train"])[0]
File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 196, in load_dataset
datasets = load_from_hf(
File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 118, in load_from_hf
raise FileNotFoundError("Couldn't find the dataset script for '" + path + "' on PaddleNLP or HuggingFace")
FileNotFoundError: Couldn't find the dataset script for '/home/aistudio/llama/data' on PaddleNLP or HuggingFace
LAUNCH INFO 2024-11-01 06:38:19,875 Exit code -15
aistudio@jupyter-942478-8345123:~/PaddleNLP/llm$ python -u -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/llama/sft_argument.json
^CTraceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details
__import__(pkg_name)
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/__init__.py", line 37, in <module>
from .base import core # noqa: F401
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/__init__.py", line 38, in <module>
from . import ( # noqa: F401
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/backward.py", line 28, in <module>
from . import core, framework, log_helper, unique_name
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/framework.py", line 41, in <module>
from .proto import (
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/proto/data_feed_pb2.py", line 5, in <module>
from google.protobuf.internal import builder as _builder
File "/home/aistudio/.local/lib/python3.8/site-packages/google/protobuf/internal/builder.py", line 18, in <module>
from google.protobuf.internal import python_message
File "/home/aistudio/.local/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 39, in <module>
from google.protobuf import text_format
File "/home/aistudio/.local/lib/python3.8/site-packages/google/protobuf/text_format.py", line 33, in <module>
from google.protobuf import unknown_fields
File "<frozen importlib._bootstrap>", line 1042, in _handle_fromlist
KeyboardInterrupt
我已经根据说明把数据文件放进目录了 不过,这个应该不影响咱们显存的分析 ~ 目前为止,PaddleNLP 能够使用 之前的内部测试发现的问题,测试环境、测试过程、测试版本分别是多少?显存增大多少?如何判断是 另外,如果只是测试 resnet 或者 llama 的话没啥意义,总不能把所有模型都测试一遍 ~ 测试的 target 是什么?或者说关注点是什么?看看还有什么地方需要单独测试关注? 感谢!~~~ 更新: 刚才重新下了测试数据,https://bj.bcebos.com/paddlenlp/datasets/examples/alpaca_demo.gz 能够进入训练流程,但是显存不够(Paddle 官方版本和 [2024-11-01 07:51:51,706] [ INFO] - We are using <class 'paddlenlp.transformers.llama.modeling.LlamaForCausalLM'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 07:51:51,707] [ INFO] - Loading weights file from cache at /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/model_state.pdparams
[2024-11-01 07:53:42,486] [ INFO] - Loaded weights file from disk, setting weights to model.
[2024-11-01 07:55:05,265] [ INFO] - All model checkpoint weights were used when initializing LlamaForCausalLM.
[2024-11-01 07:55:05,265] [ INFO] - All the weights of LlamaForCausalLM were initialized from the model checkpoint at meta-llama/Llama-2-7b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[2024-11-01 07:55:05,268] [ INFO] - Loading configuration file /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/generation_config.json
[2024-11-01 07:55:06,067] [ INFO] - We are using <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 07:55:07,259] [ INFO] - The global seed is set to 42, local seed is set to 44 and random seed is set to 42.
[2024-11-01 07:55:07,521] [ INFO] - Using half precision
[2024-11-01 07:55:07,522] [ DEBUG] - ============================================================
[2024-11-01 07:55:07,522] [ DEBUG] - Training Configuration Arguments
[2024-11-01 07:55:07,523] [ DEBUG] - paddle commit id : cead7f59d4f01bc5aff7a78206ca393f1ace553b
[2024-11-01 07:55:07,523] [ DEBUG] - paddlenlp commit id : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 07:55:07,523] [ DEBUG] - _no_sync_in_gradient_accumulation: True
[2024-11-01 07:55:07,523] [ DEBUG] - adam_beta1 : 0.9
[2024-11-01 07:55:07,523] [ DEBUG] - adam_beta2 : 0.999
[2024-11-01 07:55:07,523] [ DEBUG] - adam_epsilon : 1e-08
[2024-11-01 07:55:07,523] [ DEBUG] - amp_custom_black_list : None
[2024-11-01 07:55:07,523] [ DEBUG] - amp_custom_white_list : None
[2024-11-01 07:55:07,523] [ DEBUG] - amp_master_grad : False
[2024-11-01 07:55:07,523] [ DEBUG] - auto_parallel_resume_form_hybrid_parallel: False
[2024-11-01 07:55:07,523] [ DEBUG] - autotuner_benchmark : False
[2024-11-01 07:55:07,523] [ DEBUG] - benchmark : False
[2024-11-01 07:55:07,523] [ DEBUG] - bf16 : True
[2024-11-01 07:55:07,523] [ DEBUG] - bf16_full_eval : False
[2024-11-01 07:55:07,524] [ DEBUG] - context_parallel_degree : 1
[2024-11-01 07:55:07,524] [ DEBUG] - current_device : gpu:0
[2024-11-01 07:55:07,524] [ DEBUG] - data_parallel_config :
[2024-11-01 07:55:07,524] [ DEBUG] - data_parallel_degree : 1
[2024-11-01 07:55:07,524] [ DEBUG] - data_parallel_rank : 0
[2024-11-01 07:55:07,524] [ DEBUG] - dataloader_drop_last : False
[2024-11-01 07:55:07,524] [ DEBUG] - dataloader_num_workers : 0
[2024-11-01 07:55:07,524] [ DEBUG] - dataset_rank : 0
[2024-11-01 07:55:07,524] [ DEBUG] - dataset_world_size : 2
[2024-11-01 07:55:07,524] [ DEBUG] - ddp_find_unused_parameters : None
[2024-11-01 07:55:07,524] [ DEBUG] - decay_steps : 0
[2024-11-01 07:55:07,524] [ DEBUG] - device : gpu
[2024-11-01 07:55:07,524] [ DEBUG] - disable_tqdm : True
[2024-11-01 07:55:07,524] [ DEBUG] - distributed_dataloader : False
[2024-11-01 07:55:07,524] [ DEBUG] - do_eval : True
[2024-11-01 07:55:07,525] [ DEBUG] - do_export : False
[2024-11-01 07:55:07,525] [ DEBUG] - do_predict : False
[2024-11-01 07:55:07,525] [ DEBUG] - do_train : True
[2024-11-01 07:55:07,525] [ DEBUG] - enable_auto_parallel : False
[2024-11-01 07:55:07,525] [ DEBUG] - eval_accumulation_steps : 16
[2024-11-01 07:55:07,525] [ DEBUG] - eval_batch_size : 8
[2024-11-01 07:55:07,525] [ DEBUG] - eval_steps : None
[2024-11-01 07:55:07,525] [ DEBUG] - evaluation_strategy : IntervalStrategy.EPOCH
[2024-11-01 07:55:07,525] [ DEBUG] - flatten_param_grads : False
[2024-11-01 07:55:07,525] [ DEBUG] - force_reshard_pp : False
[2024-11-01 07:55:07,525] [ DEBUG] - fp16 : False
[2024-11-01 07:55:07,525] [ DEBUG] - fp16_full_eval : False
[2024-11-01 07:55:07,525] [ DEBUG] - fp16_opt_level : O2
[2024-11-01 07:55:07,525] [ DEBUG] - fuse_sequence_parallel_allreduce: False
[2024-11-01 07:55:07,525] [ DEBUG] - gradient_accumulation_steps : 2
[2024-11-01 07:55:07,525] [ DEBUG] - greater_is_better : True
[2024-11-01 07:55:07,526] [ DEBUG] - hybrid_parallel_topo_order : pp_first
[2024-11-01 07:55:07,526] [ DEBUG] - ignore_data_skip : False
[2024-11-01 07:55:07,526] [ DEBUG] - ignore_load_lr_and_optim : False
[2024-11-01 07:55:07,526] [ DEBUG] - ignore_save_lr_and_optim : False
[2024-11-01 07:55:07,526] [ DEBUG] - label_names : None
[2024-11-01 07:55:07,526] [ DEBUG] - lazy_data_processing : True
[2024-11-01 07:55:07,526] [ DEBUG] - learning_rate : 3e-05
[2024-11-01 07:55:07,526] [ DEBUG] - load_best_model_at_end : True
[2024-11-01 07:55:07,526] [ DEBUG] - load_sharded_model : False
[2024-11-01 07:55:07,526] [ DEBUG] - local_process_index : 0
[2024-11-01 07:55:07,526] [ DEBUG] - local_rank : 0
[2024-11-01 07:55:07,526] [ DEBUG] - log_level : -1
[2024-11-01 07:55:07,526] [ DEBUG] - log_level_replica : -1
[2024-11-01 07:55:07,526] [ DEBUG] - log_on_each_node : True
[2024-11-01 07:55:07,526] [ DEBUG] - logging_dir : /home/aistudio/llama/checkpoints/runs/Nov01_07-51-48_jupyter-942478-8345123
[2024-11-01 07:55:07,526] [ DEBUG] - logging_first_step : False
[2024-11-01 07:55:07,527] [ DEBUG] - logging_steps : 1
[2024-11-01 07:55:07,527] [ DEBUG] - logging_strategy : IntervalStrategy.STEPS
[2024-11-01 07:55:07,527] [ DEBUG] - logical_process_index : 0
[2024-11-01 07:55:07,527] [ DEBUG] - lr_end : 1e-07
[2024-11-01 07:55:07,527] [ DEBUG] - lr_scheduler_type : SchedulerType.LINEAR
[2024-11-01 07:55:07,527] [ DEBUG] - max_evaluate_steps : -1
[2024-11-01 07:55:07,527] [ DEBUG] - max_grad_norm : 1.0
[2024-11-01 07:55:07,527] [ DEBUG] - max_steps : -1
[2024-11-01 07:55:07,527] [ DEBUG] - metric_for_best_model : accuracy
[2024-11-01 07:55:07,527] [ DEBUG] - minimum_eval_times : None
[2024-11-01 07:55:07,527] [ DEBUG] - no_cuda : False
[2024-11-01 07:55:07,527] [ DEBUG] - no_recompute_layers : None
[2024-11-01 07:55:07,527] [ DEBUG] - num_cycles : 0.5
[2024-11-01 07:55:07,527] [ DEBUG] - num_train_epochs : 1.0
[2024-11-01 07:55:07,527] [ DEBUG] - optim : OptimizerNames.ADAMW
[2024-11-01 07:55:07,527] [ DEBUG] - optimizer_name_suffix : shard00
[2024-11-01 07:55:07,528] [ DEBUG] - output_dir : /home/aistudio/llama/checkpoints
[2024-11-01 07:55:07,528] [ DEBUG] - output_signal_dir : /home/aistudio/llama/checkpoints
[2024-11-01 07:55:07,528] [ DEBUG] - overwrite_output_dir : False
[2024-11-01 07:55:07,528] [ DEBUG] - past_index : -1
[2024-11-01 07:55:07,528] [ DEBUG] - per_device_eval_batch_size : 8
[2024-11-01 07:55:07,528] [ DEBUG] - per_device_train_batch_size : 1
[2024-11-01 07:55:07,528] [ DEBUG] - pipeline_parallel_config : disable_p2p_cache_shape
[2024-11-01 07:55:07,528] [ DEBUG] - pipeline_parallel_degree : 1
[2024-11-01 07:55:07,528] [ DEBUG] - pipeline_parallel_rank : 0
[2024-11-01 07:55:07,528] [ DEBUG] - power : 1.0
[2024-11-01 07:55:07,528] [ DEBUG] - pp_recompute_interval : 1
[2024-11-01 07:55:07,528] [ DEBUG] - prediction_loss_only : False
[2024-11-01 07:55:07,528] [ DEBUG] - process_index : 0
[2024-11-01 07:55:07,528] [ DEBUG] - recompute : False
[2024-11-01 07:55:07,528] [ DEBUG] - recompute_granularity : full
[2024-11-01 07:55:07,528] [ DEBUG] - recompute_use_reentrant : False
[2024-11-01 07:55:07,529] [ DEBUG] - release_grads : False
[2024-11-01 07:55:07,529] [ DEBUG] - remove_unused_columns : True
[2024-11-01 07:55:07,529] [ DEBUG] - report_to : ['visualdl']
[2024-11-01 07:55:07,529] [ DEBUG] - resume_from_checkpoint : None
[2024-11-01 07:55:07,529] [ DEBUG] - run_name : /home/aistudio/llama/checkpoints
[2024-11-01 07:55:07,529] [ DEBUG] - save_on_each_node : False
[2024-11-01 07:55:07,529] [ DEBUG] - save_sharded_model : False
[2024-11-01 07:55:07,529] [ DEBUG] - save_steps : 500
[2024-11-01 07:55:07,529] [ DEBUG] - save_strategy : IntervalStrategy.EPOCH
[2024-11-01 07:55:07,529] [ DEBUG] - save_total_limit : 1
[2024-11-01 07:55:07,529] [ DEBUG] - scale_loss : 32768
[2024-11-01 07:55:07,529] [ DEBUG] - seed : 42
[2024-11-01 07:55:07,529] [ DEBUG] - sep_parallel_degree : 1
[2024-11-01 07:55:07,529] [ DEBUG] - sequence_parallel : False
[2024-11-01 07:55:07,529] [ DEBUG] - sequence_parallel_config :
[2024-11-01 07:55:07,529] [ DEBUG] - sharding : [<ShardingOption.SHARD_GRAD_OP: 'stage2'>]
[2024-11-01 07:55:07,530] [ DEBUG] - sharding_comm_buffer_size_MB : -1
[2024-11-01 07:55:07,530] [ DEBUG] - sharding_degree : -1
[2024-11-01 07:55:07,530] [ DEBUG] - sharding_parallel_config :
[2024-11-01 07:55:07,530] [ DEBUG] - sharding_parallel_degree : 2
[2024-11-01 07:55:07,530] [ DEBUG] - sharding_parallel_rank : 0
[2024-11-01 07:55:07,530] [ DEBUG] - should_load_dataset : True
[2024-11-01 07:55:07,530] [ DEBUG] - should_load_sharding_stage1_model: False
[2024-11-01 07:55:07,530] [ DEBUG] - should_log : True
[2024-11-01 07:55:07,530] [ DEBUG] - should_save : True
[2024-11-01 07:55:07,530] [ DEBUG] - should_save_model_state : True
[2024-11-01 07:55:07,530] [ DEBUG] - should_save_sharding_stage1_model: False
[2024-11-01 07:55:07,530] [ DEBUG] - skip_data_intervals : None
[2024-11-01 07:55:07,530] [ DEBUG] - skip_memory_metrics : True
[2024-11-01 07:55:07,530] [ DEBUG] - skip_profile_timer : True
[2024-11-01 07:55:07,530] [ DEBUG] - tensor_parallel_config :
[2024-11-01 07:55:07,530] [ DEBUG] - tensor_parallel_degree : 1
[2024-11-01 07:55:07,530] [ DEBUG] - tensor_parallel_output : False
[2024-11-01 07:55:07,531] [ DEBUG] - tensor_parallel_rank : 0
[2024-11-01 07:55:07,531] [ DEBUG] - to_static : False
[2024-11-01 07:55:07,531] [ DEBUG] - train_batch_size : 1
[2024-11-01 07:55:07,531] [ DEBUG] - unified_checkpoint : True
[2024-11-01 07:55:07,531] [ DEBUG] - unified_checkpoint_config : ['']
[2024-11-01 07:55:07,531] [ DEBUG] - use_async_save : False
[2024-11-01 07:55:07,531] [ DEBUG] - use_expert_parallel : False
[2024-11-01 07:55:07,531] [ DEBUG] - use_flash_attention : False
[2024-11-01 07:55:07,531] [ DEBUG] - use_fused_dropout_add : False
[2024-11-01 07:55:07,531] [ DEBUG] - use_fused_linear : False
[2024-11-01 07:55:07,531] [ DEBUG] - use_fused_rms_norm : False
[2024-11-01 07:55:07,531] [ DEBUG] - use_fused_rope : False
[2024-11-01 07:55:07,531] [ DEBUG] - use_hybrid_parallel : True
[2024-11-01 07:55:07,531] [ DEBUG] - virtual_pp_degree : 1
[2024-11-01 07:55:07,531] [ DEBUG] - wandb_api_key : None
[2024-11-01 07:55:07,531] [ DEBUG] - warmup_ratio : 0.0
[2024-11-01 07:55:07,532] [ DEBUG] - warmup_steps : 30
[2024-11-01 07:55:07,532] [ DEBUG] - weight_decay : 0.0
[2024-11-01 07:55:07,532] [ DEBUG] - weight_name_suffix :
[2024-11-01 07:55:07,532] [ DEBUG] - world_size : 2
[2024-11-01 07:55:07,532] [ DEBUG] -
[2024-11-01 07:55:07,533] [ INFO] - Starting training from resume_from_checkpoint : None
/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/communication/group.py:128: UserWarning: Current global rank 0 is not in group _default_pg12
warnings.warn(
WARNING:root:While using ClipGradByGlobalNorm in GroupShardedOptimizerStage2, the grad clip of original optimizer will be changed.
Traceback (most recent call last):
File "run_finetune.py", line 735, in <module>
main()
File "run_finetune.py", line 575, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py", line 798, in train
model = self._wrap_model(self.model_wrapped)
File "/home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py", line 2053, in _wrap_model
model, optimizer, _ = group_sharded_parallel(
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/sharding/group_sharded.py", line 156, in group_sharded_parallel
optimizer = GroupShardedOptimizerStage2(
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 240, in __init__
self._update_opt_status()
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 343, in _update_opt_status
self._integration_params()
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 463, in _integration_params
self._generate_master_params(trainable_params)
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 336, in _generate_master_params
master_tensor = paddle.cast(param, Type.fp32.value)
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/tensor/manipulation.py", line 237, in cast
return _C_ops.cast(x, dtype)
MemoryError:
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::pybind::eager_api_cast(_object*, _object*, _object*)
1 cast_ad_func(paddle::Tensor const&, phi::DataType)
2 paddle::experimental::cast(paddle::Tensor const&, phi::DataType)
3 void phi::CastKernel<phi::dtype::bfloat16, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
4 void phi::CastCUDAKernelImpl<phi::dtype::bfloat16, float>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
5 float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
6 phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
7 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
8 paddle::memory::allocation::Allocator::Allocate(unsigned long)
9 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 paddle::memory::allocation::Allocator::Allocate(unsigned long)
12 paddle::memory::allocation::Allocator::Allocate(unsigned long)
13 paddle::memory::allocation::Allocator::Allocate(unsigned long)
14 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
15 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
16 common::enforce::GetCurrentTraceBackString[abi:cxx11](bool)
----------------------
Error Message Summary:
----------------------
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 64.000000MB memory on GPU 0, 15.713623GB memory has been allocated and available memory is only 60.125000MB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
(at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:84)
LAUNCH INFO 2024-11-01 07:55:20,920 Pod failed
LAUNCH ERROR 2024-11-01 07:55:20,921 Container failed !!!
Container rank 0 status failed cmd ['/usr/bin/python', '-u', 'run_finetune.py', './config/llama/sft_argument.json'] code 1 log log/workerlog.0
LAUNCH INFO 2024-11-01 07:55:20,921 ------------------------- ERROR LOG DETAIL -------------------------
stributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 240, in __init__
self._update_opt_status()
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 343, in _update_opt_status
self._integration_params()
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 463, in _integration_params
self._generate_master_params(trainable_params)
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 336, in _generate_master_params
master_tensor = paddle.cast(param, Type.fp32.value)
File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/tensor/manipulation.py", line 237, in cast
return _C_ops.cast(x, dtype)
MemoryError:
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::pybind::eager_api_cast(_object*, _object*, _object*)
1 cast_ad_func(paddle::Tensor const&, phi::DataType)
2 paddle::experimental::cast(paddle::Tensor const&, phi::DataType)
3 void phi::CastKernel<phi::dtype::bfloat16, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
4 void phi::CastCUDAKernelImpl<phi::dtype::bfloat16, float>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
5 float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
6 phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
7 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
8 paddle::memory::allocation::Allocator::Allocate(unsigned long)
9 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 paddle::memory::allocation::Allocator::Allocate(unsigned long)
12 paddle::memory::allocation::Allocator::Allocate(unsigned long)
13 paddle::memory::allocation::Allocator::Allocate(unsigned long)
14 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
15 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
16 common::enforce::GetCurrentTraceBackString[abi:cxx11](bool)
----------------------
Error Message Summary:
----------------------
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 64.000000MB memory on GPU 0, 15.713623GB memory has been allocated and available memory is only 60.125000MB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
(at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:84)
LAUNCH INFO 2024-11-01 07:55:21,936 Exit code 1
aistudio@jupyter-942478-8345123:~/PaddleNLP/llm$
这是 |
@megemini,感谢提供测试结论,我明天再反馈给相关研发看看 |
Sorry to inform you that 6544a48's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
我们测试了stage2和stage2+继续训练两种方式,然后单独的stage2训练显存没问题,问题出在stage2+继续训练会出现OOM的问题。所以假设训练步数为S,是否可以尝试在S中间第一次保存下ckpt后,终止程序,然后重新运行训练程序并加载这个ckpt继续训练,此时应该能出现显存比没有这个PR时更大的情况?另外nvidia-smi的信息是不准确,可以通过以下方式,在每个step的优化器更新结束后(optimizer.step())后面打印下显存: def print_memory_state(msg=''):
""" print_memory_state """
import time
import datetime
timestamp = time.time()
dt_object = datetime.datetime.fromtimestamp(timestamp)
GB = 1024.0 * 1024.0 * 1024.0
memory_allocated = paddle.device.cuda.memory_allocated() / GB
memory_reserved = paddle.device.cuda.memory_reserved() / GB
max_memory_allocated = paddle.device.cuda.max_memory_allocated() / GB
max_memory_reserved = paddle.device.cuda.max_memory_reserved() / GB
print(f'{dt_object}, {msg}, '
f'memory_allocated: {memory_allocated:.02f}GB, '
f'memory_reserved: {memory_reserved:.02f}GB, '
f'max_memory_allocated: {max_memory_allocated:.02f}GB, '
f'max_memory_reserved: {max_memory_reserved:.02f}GB') |
那问题应该还是出在分布式这里 ~ 这个过程包括:
也就是说,加载原来的模型训练没有问题,但是:
会多占用显存 ~ 感觉 我也测一下分布式的显存占用情况 ~ 感谢 ~ |
@HydrogenSulfate 之前说过,AIStudio 上最大的环境就是双卡 16g ,刚才又试了一下,不行,会 OOM (Paddle 官方版本也会),而且 sft_argument.json 中已经设置 batch size 是 看看还有什么优化办法?或者换个模型?否则我这边进行不下去了 ~ 另外,我试了一下自定义小模型上,分布式训练占用显存的情况,结论是:
测试代码如下(参考文档 https://github.com/PaddlePaddle/community/blob/master/pfcc/paddle-code-reading/auto_parallel/paddle_distributed_primer.md#22412-%E5%8A%A8%E6%89%8B-group-sharded%E5%B9%B6%E8%A1%8C%E7%A4%BA%E4%BE%8B%E4%BB%A3%E7%A0%81 ,其中 # -*- coding: UTF-8 -*-
# 2.2.4.1.2 动手-group sharded并行示例代码
import os
import numpy as np
import paddle
# 导入必要分布式训练的依赖包
from paddle.distributed import fleet, get_rank
from paddle.distributed.sharding import group_sharded_parallel
# 导入数据加载和数据保存接口
from paddle.io import Dataset, DistributedBatchSampler, DataLoader
base_lr = 0.1 # 学习率
momentum_rate = 0.9 # 冲量
l2_decay = 1e-4 # 权重衰减
epoch = 5 #训练迭代次数
batch_num = 100 #每次迭代的 batch 数
batch_size = 32 #训练批次大小
class_dim = 10
USE_CKPT = False
# 设置数据读取器
class RandomDataset(Dataset):
def __init__(self, num_samples):
self.num_samples = num_samples
def __getitem__(self, idx):
image = np.random.random([256]).astype('float32')
label = np.random.randint(0, class_dim - 1, (1, )).astype('int64')
return image, label
def __len__(self):
return self.num_samples
# 设置优化器
def optimizer_setting(parameter_list=None):
optimizer = paddle.optimizer.AdamW(
learning_rate=base_lr,
weight_decay=l2_decay,
parameters=parameter_list)
return optimizer
def print_memory_state(msg=''):
""" print_memory_state """
import time
import datetime
timestamp = time.time()
dt_object = datetime.datetime.fromtimestamp(timestamp)
GB = 1024.0 * 1024.0 * 1024.0
memory_allocated = paddle.device.cuda.memory_allocated() / GB
memory_reserved = paddle.device.cuda.memory_reserved() / GB
max_memory_allocated = paddle.device.cuda.max_memory_allocated() / GB
max_memory_reserved = paddle.device.cuda.max_memory_reserved() / GB
print(f'{dt_object}, {msg}, '
f'memory_allocated: {memory_allocated:.02f}GB, '
f'memory_reserved: {memory_reserved:.02f}GB, '
f'max_memory_allocated: {max_memory_allocated:.02f}GB, '
f'max_memory_reserved: {max_memory_reserved:.02f}GB')
# 模型网络
class SimpleNet(paddle.nn.Layer):
def __init__(self, input_size, inner_size, output_size):
super().__init__()
self.linear1 = paddle.nn.Linear(input_size, inner_size)
self.linear2 = paddle.nn.Linear(inner_size, input_size)
self.linear3 = paddle.nn.Linear(input_size, output_size)
self.relu = paddle.nn.ReLU()
def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
x = self.linear3(x)
x = self.relu(x)
return x
# 设置训练函数
def train_model():
# 初始化 Fleet 环境
fleet.init(is_collective=True)
group = paddle.distributed.new_group([0, 1])
model = SimpleNet(input_size=256, inner_size=102400, output_size=class_dim)
optimizer = optimizer_setting(parameter_list=model.parameters())
# wrap GroupSharded model, optimizer and scaler. level1='os', level2='os_g', level3='p_g_os'
# model, optimizer, scaler = group_sharded_parallel(model, optimizer, level="p_g_os", group=group)
model, optimizer, scaler = group_sharded_parallel(model, optimizer, level="os_g", group=group)
dataset = RandomDataset(batch_num * batch_size)
# 设置分布式批采样器,用于数据并行训练
sampler = DistributedBatchSampler(dataset, rank=get_rank(),
batch_size=batch_size,shuffle=False, drop_last=True)
train_loader = DataLoader(dataset,
batch_sampler=sampler,
num_workers=1)
if USE_CKPT:
model_dict = paddle.load("checkpoints/model.pdparams")
model.set_state_dict(model_dict)
for eop in range(epoch):
model.train()
for batch_id, data in enumerate(train_loader()):
img, label = data
label.stop_gradient = True
out = model(img)
loss = paddle.nn.functional.cross_entropy(input=out, label=label)
avg_loss = paddle.mean(x=loss)
acc_top1 = paddle.metric.accuracy(input=out, label=label, k=1)
acc_top5 = paddle.metric.accuracy(input=out, label=label, k=5)
avg_loss.backward()
optimizer.step()
print_memory_state()
model.clear_gradients()
if batch_id % 5 == 0:
print("[Epoch %d, batch %d] loss: %.5f, acc1: %.5f, acc5: %.5f" % (eop, batch_id, avg_loss, acc_top1, acc_top5))
# 保存Layer参数
paddle.save(model.state_dict(), "checkpoints/model.pdparams")
# 启动训练
if __name__ == '__main__':
train_model()
print('>>> USE_CKPT:', USE_CKPT)
print('>>> commit:', paddle.version.commit)
通过修改 运行命令为: > python -m paddle.distributed.launch --gpus=0,1 --log_dir logs test_0.py
截取最后一个 epoch 的输出:
Paddle 官方版本 [Epoch 4, batch 45] loss: 2.30259, acc1: 0.15625, acc5: 0.59375
2024-11-05 13:14:10.537434, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:14:10.562654, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:14:10.587575, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:14:10.612507, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
>>> USE_CKPT: False
>>> commit: 11d1f4835f5afce78c0e9882f144877b3c4a9aac
amsgrad 版本(编译版本,不使用 amsgrad 参数) [Epoch 4, batch 45] loss: 2.30259, acc1: 0.06250, acc5: 0.65625
2024-11-05 13:16:29.810763, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:16:29.835822, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:16:29.860752, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:16:29.885698, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
>>> USE_CKPT: False
>>> commit: 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
Paddle 官方版本 [Epoch 4, batch 45] loss: 2.30259, acc1: 0.15625, acc5: 0.62500
2024-11-05 13:14:39.615797, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:14:39.641230, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:14:39.666193, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:14:39.691418, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
>>> USE_CKPT: True
>>> commit: 11d1f4835f5afce78c0e9882f144877b3c4a9aac
amsgrad 版本(编译版本,不使用 amsgrad 参数) [Epoch 4, batch 45] loss: 2.30259, acc1: 0.06250, acc5: 0.71875
2024-11-05 13:17:21.994855, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:17:22.019996, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:17:22.044928, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:17:22.069807, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
>>> USE_CKPT: True
>>> commit: 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
可以看到,加载 checkpoint 后,显存占用变大了(是我这里调用方法有问题?),但是 Paddle 版本与 amsgrad 版本显存大小一致 ~ 后续可以尝试的测试手段:
还请帮忙看一下 ~ 感谢!!! |
@megemini 好的,明天我在我们的机器上自测一下llama试试。问题确实比较奇怪 |
@megemini 加载ckpt后显存变大有可能是因为paddle.load不支持map_location='cpu',导致加载进来的参数在GPU上 |
感谢!!! 分布式这部分逻辑比较复杂,主要是担心,之前没有涉及到 optional 的 optimizer 参数,会不会有需要特殊处理的地方 ... ... |
af27337
PR Category
User Experience
PR Types
New features
Description
【Hackathon 7th No.12】Adam、AdamW 优化器支持 amsgrad
关联:
本地对比 pytorch 的结果,两者一致:
比对代码
输出结果:
Update 20240908
已在本地完成:
相关测试。
需要在 CI 环境中验证分布式的测试项目
需要在 CI 环境中验证其他测试项目
另外,xpu 的 amsgrad 变体,由于 xpu 底层接口暂不支持,因此,此处只修改了相关的输入输出参数列表。