Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon 7th PPSCI No.12】Adam、AdamW 优化器支持 amsgrad -part #68079

Open
wants to merge 45 commits into
base: develop
Choose a base branch
from

Conversation

megemini
Copy link
Contributor

@megemini megemini commented Sep 8, 2024

PR Category

User Experience

PR Types

New features

Description

【Hackathon 7th No.12】Adam、AdamW 优化器支持 amsgrad

关联:

本地对比 pytorch 的结果,两者一致:

比对代码
import numpy as np

import torch
import paddle


def func(t, x):
    if t % 101 == 1:
        return 1010 * x
    else:
        return -10 * x


np.random.seed(2024)
data = np.array(0).astype("float64")
epoch = 500
lr = 0.1

for amsgrad in [True, False]:
    for opt_name, opt_torch, opt_paddle in [
        ["Adam", torch.optim.Adam, paddle.optimizer.Adam],
        ["AdamW", torch.optim.AdamW, paddle.optimizer.AdamW],
    ]:
        for torch_device, paddle_device in [["cpu", "cpu"], ["cuda", "gpu"]]:
            print(f"------ optimizer is : {opt_name} ; compare : {paddle_device}------")
            print(f"------ pytorch ------")
            x = torch.tensor(data, device=torch.device(torch_device))
            x.requires_grad = True

            optimizer = opt_torch([x], lr=lr, amsgrad=amsgrad)
            for i in range(epoch):
                y = func(i, x)
                optimizer.zero_grad()
                y.backward()
                optimizer.step()

            if torch_device == "cuda":
                x_torch = x.cpu().detach().numpy()
                y_torch = y.cpu().detach().numpy()
            else:
                x_torch = x.detach().numpy()
                y_torch = y.detach().numpy()

            print(f"------ paddle ------")
            paddle.set_device(paddle_device)
            x = paddle.to_tensor(data)
            x.stop_gradient = False

            optimizer = opt_paddle(parameters=[x], learning_rate=lr, amsgrad=amsgrad)
            for i in range(epoch):
                y = func(i, x)
                optimizer.clear_grad()
                y.backward()
                optimizer.step()

            x_paddle = x.numpy()
            y_paddle = y.numpy()

            np.testing.assert_allclose(x_torch, x_paddle, atol=1e-06, rtol=1e-06)
            print(x_torch, x_paddle)
            print(y_torch, y_paddle)
            print(f"------- compare finish ---------")

输出结果:

------ optimizer is : Adam ; compare : cpu------
------ pytorch ------
------ paddle ------
0.382819332566745 0.3828193325667452
-3.7319234136114865 -3.7319234136114887
------- compare finish ---------
------ optimizer is : Adam ; compare : gpu------
------ pytorch ------
------ paddle ------
0.3828193325667449 0.38281933256674533
-3.7319234136114856 -3.73192341361149
------- compare finish ---------
------ optimizer is : AdamW ; compare : cpu------
------ pytorch ------
------ paddle ------
0.38940724227589385 0.389407242265435
-3.801604114817793 -3.8016041146280424
------- compare finish ---------
------ optimizer is : AdamW ; compare : gpu------
------ pytorch ------
------ paddle ------
0.38940724227589385 0.3894072422654346
-3.801604114817793 -3.801604114628038
------- compare finish ---------
------ optimizer is : Adam ; compare : cpu------
------ pytorch ------
------ paddle ------
0.47233193956960806 0.47233193956960845
-4.62253146676283 -4.622531466762833
------- compare finish ---------
------ optimizer is : Adam ; compare : gpu------
------ pytorch ------
------ paddle ------
0.472331939569608 0.4723319395696082
-4.62253146676283 -4.6225314667628306
------- compare finish ---------
------ optimizer is : AdamW ; compare : cpu------
------ pytorch ------
------ paddle ------
0.462192080569021 0.46219208087997216
-4.525658535292251 -4.525658538303618
------- compare finish ---------
------ optimizer is : AdamW ; compare : gpu------
------ pytorch ------
------ paddle ------
0.46219208056902106 0.46219208087997266
-4.525658535292251 -4.525658538303623
------- compare finish ---------

Update 20240908

  • 已在本地完成:

    • test_adam_op.py
    • test_adamw_op.py
    • test_merged_adam_op.py
    • test_fused_adam_op.py

    相关测试。

  • 需要在 CI 环境中验证分布式的测试项目

  • 需要在 CI 环境中验证其他测试项目

另外,xpu 的 amsgrad 变体,由于 xpu 底层接口暂不支持,因此,此处只修改了相关的输入输出参数列表。

Copy link

paddle-bot bot commented Sep 8, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Sep 8, 2024
@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Sep 9, 2024
@PaddlePaddle PaddlePaddle unlocked this conversation Sep 9, 2024
Copy link
Contributor

@HydrogenSulfate HydrogenSulfate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

添加的ams_grad是否会影响原有的代码执行逻辑和存储空间占用情况?PR的代码起来无论是否开启ams_grad,都会比原先没有amsgrad的代码多申请一段mom2_max的空间,以及有一些多余的变量产生。


inline HOSTDEVICE void operator()(size_t i) const {
// Merge all memory access together.
T g = grad_[i];
T mom1 = moment1_[i];
T mom2 = moment2_[i];
T mom2_max = moment2_max_[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是必须要记录的吗?

Comment on lines 236 to 248
T mom2_max_;
if (amsgrad_) {
mom2_max_ = std::max(mom2, mom2_max);
p -= lr * (mom1 / (sqrt(mom2_max_) + epsilon_ * sqrt(1 - beta2_pow)));
} else {
mom2_max_ = mom2_max;
p -= lr * (mom1 / (sqrt(mom2) + epsilon_ * sqrt(1 - beta2_pow)));
}

// Write back to global memory
moment1_out_[i] = mom1;
moment2_out_[i] = mom2;
moment2_max_out_[i] = mom2_max_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同理,如果amsgrad没有开启,建议不要添加任何多余的变量和相关计算逻辑,保持原样即可

Comment on lines 326 to 327
Eigen::Map<Eigen::Array<T, 1, Eigen::Dynamic>> moment2_max_out{
moment2_max_out_, static_cast<Eigen::Index>(numel)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,如果没有开启amsgrad,是否会有mom2_max相关的冗余运算和存储占用?


inline HOSTDEVICE void adam_update(size_t i, T g) const {
// The following code is the same as dense
T mom1 = moment1_[i];
T mom2 = moment2_[i];
T mom2_max = moment2_max_[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

@@ -14,6 +14,7 @@

#pragma once

#include <stdio.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个头文件是什么有代码依赖吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调试之后忘记删掉了,抱歉 ~

@@ -117,6 +117,7 @@ class Adam(Optimizer):
The default value is False.
multi_precision (bool, optional): Whether to use multi-precision during weight updating. Default is false.
use_multi_tensor (bool, optional): Whether to use multi-tensor strategy to update all parameters at once . Default is false.
amsgrad (bool, optional): Whether to use the AMSGrad of this algorithm. Default is false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -104,6 +104,7 @@ class AdamW(Optimizer):
different semantics with the original Adam algorithm and may lead to different result.
The default value is False.
multi_precision (bool, optional): Whether to use multi-precision during weight updating. Default is false.
amsgrad (bool, optional): Whether to use the AMSGrad of this algorithm. Default is false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

@megemini
Copy link
Contributor Author

megemini commented Sep 9, 2024

添加的ams_grad是否会影响原有的代码执行逻辑和存储空间占用情况?PR的代码起来无论是否开启ams_grad,都会比原先没有amsgrad的代码多申请一段mom2_max的空间,以及有一些多余的变量产生。

这个之前考虑过,主要是因为,目前涉及到 amsgrad 的地方太多了,所以优化相关的事情想先往后放一下 ~

那我现在改一下试试吧 ~

@HydrogenSulfate
Copy link
Contributor

HydrogenSulfate commented Sep 9, 2024

添加的ams_grad是否会影响原有的代码执行逻辑和存储空间占用情况?PR的代码起来无论是否开启ams_grad,都会比原先没有amsgrad的代码多申请一段mom2_max的空间,以及有一些多余的变量产生。

这个之前考虑过,主要是因为,目前涉及到 amsgrad 的地方太多了,所以优化相关的事情想先往后放一下 ~

那我现在改一下试试吧 ~

  1. 这一点影响是比较大的。因为一般情况下优化器是逐元素跟踪参数状态,所以优化器每一个统计量需要记录的数量都等于模型参数数量,adam(w)这种带动量的优化器则更会多。因此模型训练过程中显存占比前三就是中间状态、优化器参数、模型参数,如果没有优化,很可能原先在16G上能训的下的CV、NLP模型就会OOM了,更不用说B级别参数量的大模型

  2. 代码本身的计算逻辑应该没太大问题,目前没有优化的版本可以用于快速验证正确性,但最终版本一定要考虑到这种基本但必要的优化

@HydrogenSulfate
Copy link
Contributor

另外可以在修改完成后,用ResNet50或者其他模型,以fake data为输入做一个对比,确认下amsgrad关闭时,显存无变化,开启时显存增加量与参数量基本相同。

Copy link

paddle-ci-bot bot commented Oct 20, 2024

Sorry to inform you that d157301's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@HydrogenSulfate
Copy link
Contributor

Update 20241010

  • 修复了 adamw.cc 通过 adamw_attr_t 调用算子
  • 补充 adam 的 old ir 的测试

第一个问题是之前改 adamw.cc 的时候遗漏的,导致 adamw.cc 的测试覆盖不到。现以修复,本地测试可以正确调用测试用例。

第二个问题是在 test_adam_op.py 中添加了 old ir 的测试。这个文件里面现在没有 old ir 的测试了,这里特别补充。

对于之前提到的其他 CI 覆盖问题:

  1. adam_kernel.cc

Paddle/paddle/phi/kernels/cpu/adam_kernel.cc 之前提示覆盖不全,确定是 CI 的问题,在这个文件中添加 std::cout 并测试,能够看到输出项:

image

image

这里贴上修改后的文件,只加上了 std::cout 的输出:

adam_kernel.cc
从输出日志来看,std::cout 的输出都在,而 CI 中提到的覆盖不到的代码,与其是顺序执行关系,理论上不应该执行不到 ~

  1. adam.py

image

这里提到的覆盖不到,是由于目前 optimizer 只支持动态图调用

image

所以这两个分支走不到,目前貌似也添加不了测试用例了 ~

其他的应该能覆盖的都做了,再看一下 CI 吧 ~

@HydrogenSulfate

内部确认了一下,覆盖率尤其是CPU的算子覆盖率由于策略的问题,可能会无法检测到覆盖代码。

phlrain
phlrain previously approved these changes Oct 21, 2024
Copy link
Contributor

@HydrogenSulfate HydrogenSulfate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不好意思忘记submit了,我们内部测了一下发现,对于_C_ops.adam(w).(...)这种调用会有影响,可能需要把新增的参数暴露顺序放至最后,否则是不兼容升级

@megemini
Copy link
Contributor Author

megemini commented Oct 24, 2024

不好意思忘记submit了,我们内部测了一下发现,对于_C_ops.adam(w).(...)这种调用会有影响,可能需要把新增的参数暴露顺序放至最后,否则是不兼容升级

我在 PaddlePaddle/PLSC#216 (comment) 已经回复了具体的修改方法,其实就是按照位置加上 None 就行 ~

直接用 _legacy_C_ops.adamw 进行调用,返回值的数量不同,就已经算是 不兼容升级 了,那参数的传入一起改掉就可以了啊 ~ 为啥只能改返回值数量,不能修改输入参数???

另外,amsgradmoment2_max 这两个参数中,amsgrad 是个 attr ,已经放到了参数列表的最后,而 moment2_max 是个 Tensor ,我记得 Tensor 这类 input 必须放到 attr 之前吧?而且,Tensor 也没法在 ops.yaml 里面设置默认值,也没法放到参数列表后面 ... ...

@HydrogenSulfate

@megemini
Copy link
Contributor Author

另外,补充一点,如果希望兼容旧版本 _C_ops 接口的话,可以做个分支,如 Adam

ADAM_WITH_AMSGRAD = hasattr(paddle.optimizer.Adam, '_moment2_acc_max_str')

...

def foo():
  ...

  if ADAM_WITH_AMSGRAD:
    _ = _C_ops.adam_(..., None, ..., 'amsgrad', False)
  else:
    _ = _C_ops.adam_(...)

...

先判断是否是 ADAM_WITH_AMSGRAD,如果有 '_moment2_acc_max_str' 表示是新的接口,则走第一个分支,参数里面带上 Noneamsgrad 之类的 ~ 如果没有,则是旧接口,保留原来逻辑不动就行 ~

Copy link

paddle-ci-bot bot commented Oct 28, 2024

Sorry to inform you that 1c05064's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@HydrogenSulfate
Copy link
Contributor

@megemini 我们内部测试在llama上对显存有影响,会让显存变大。能否麻烦使用paddlenlp的这个教程,测一下llama-7b在显存上是否有影响呢?https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#22-(将sft_argument.json里的模型改为llama-7b)

@megemini
Copy link
Contributor Author

megemini commented Oct 28, 2024

@megemini 我们内部测试在llama上对显存有影响,会让显存变大。能否麻烦使用paddlenlp的这个教程,测一下llama-7b在显存上是否有影响呢?https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#22-(将`sft_argument.json`里的模型改为llama-7b)

具体怎么测试的?大多少?之前咱们验证过显存这个问题,如果不使用 amsgrad 的话没有影响,如果有影响的话应该可以通过大小判断是否是咱们引入的 ~ 而 cuda 版本等环境问题都可能影响显存大小,这个咱们之前也遇到过 ~

我搜了 paddlenlp 里面的实现方式,都是调用的 optimizer.AdamW 这类方式(有一个 gpt3 PaddleNLP/slm/model_zoo/gpt-3/ppfleetx/optims/optimizer.py 用的 _C_ops.adamw_,这个后面需要改一下),默认没有 amsgrad 这个参数,而 optimizer.AdamW 如果不使用 amsgrad 则初始化的都是 None ,算子传入 None 之后,都是不会初始化的,PaddlePaddle/PLSC#216 (comment) 也可以看到是未初始化的状态 ~ 没有初始化的 Tensor 不应该会增加显存吧?

我拉一下最新的版本再编译一下看看 ~ https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#22- 这个里面的测试环境我试试看 aistudio 能不能复现,不确定 aistudio 能否放的下 ... ... 🙏🙏🙏

@HydrogenSulfate
Copy link
Contributor

HydrogenSulfate commented Oct 28, 2024

@megemini 我们内部测试在llama上对显存有影响,会让显存变大。能否麻烦使用paddlenlp的这个教程,测一下llama-7b在显存上是否有影响呢?https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#22-(将sft_argument.json里的模型改为llama-7b)

具体怎么测试的?大多少?之前咱们验证过显存这个问题,如果不使用 amsgrad 的话没有影响,如果有影响的话应该可以通过大小判断是否是咱们引入的 ~ 而 cuda 版本等环境问题都可能影响显存大小,这个咱们之前也遇到过 ~

我搜了 paddlenlp 里面的实现方式,都是调用的 optimizer.AdamW 这类方式(有一个 gpt3 PaddleNLP/slm/model_zoo/gpt-3/ppfleetx/optims/optimizer.py 用的 _C_ops.adamw_,这个后面需要改一下),默认没有 amsgrad 这个参数,而 optimizer.AdamW 如果不使用 amsgrad 则初始化的都是 None ,算子传入 None 之后,都是不会初始化的,PaddlePaddle/PLSC#216 (comment) 也可以看到是未初始化的状态 ~ 没有初始化的 Tensor 不应该会增加显存吧?

我拉一下最新的版本再编译一下看看 ~ https://github.com/PaddlePaddle/PaddleNLP/tree/develop/llm#22- 这个里面的测试环境我试试看 aistudio 能不能复现,不确定 aistudio 能否放的下 ... ... 🙏🙏🙏

除了resnet,麻烦辛苦再用llama2-7b再验证一下吧,看下显存前后占用变化就行

@megemini
Copy link
Contributor Author

megemini commented Nov 1, 2024

@HydrogenSulfate 这两天事情比较多,刚抽空测了下 PaddleNLP 中 llama 关于显存占用的情况 ~

先说结论:AIStudio 上编译的带有 amsgrad 版本的 Paddle ,其显存占用 小于 Paddle 发布的开发版本 paddlepaddle-gpu 3.0.0.dev20241030

AIStudio 上面我是用的双卡的环境测试的,命令如下:

python -u  -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/llama/sft_argument.json

配置文件 sft_argument.json 如下

{
    "model_name_or_path": "meta-llama/Llama-2-7b",
    "dataset_name_or_path": "/home/aistudio/llama/data",
    "output_dir": "/home/aistudio/llama/checkpoints",
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 2,
    "per_device_eval_batch_size": 8,
    "eval_accumulation_steps":16,
    "num_train_epochs": 1,
    "learning_rate": 3e-05,
    "warmup_steps": 30,
    "logging_steps": 1,
    "evaluation_strategy": "epoch",
    "save_strategy": "epoch",
    "src_length": 1024,
    "max_length": 2048,
    "bf16": true,
    "fp16_opt_level": "O2",
    "do_train": true,
    "do_eval": true,
    "disable_tqdm": true,
    "load_best_model_at_end": true,
    "eval_with_do_generation": false,
    "metric_for_best_model": "accuracy",
    "recompute": false,
    "save_total_limit": 1,
    "tensor_parallel_degree": 1,
    "pipeline_parallel_degree": 1,
    "pipeline_parallel_config": "disable_p2p_cache_shape",
    "sharding": "stage2",
    "zero_padding": false,
    "unified_checkpoint": true,
    "use_flash_attention": false
  }

这里只修改了模型名称与存储目录。

运行命令后,输出的日志如下:

aistudio@jupyter-942478-8345123:~/PaddleNLP/llm$ python -u  -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/llama/sft_argument.json
LAUNCH INFO 2024-11-01 06:45:05,596 -----------  Configuration  ----------------------
LAUNCH INFO 2024-11-01 06:45:05,596 auto_cluster_config: 0
LAUNCH INFO 2024-11-01 06:45:05,596 auto_parallel_config: None
LAUNCH INFO 2024-11-01 06:45:05,596 auto_tuner_json: None
LAUNCH INFO 2024-11-01 06:45:05,596 devices: 0,1
LAUNCH INFO 2024-11-01 06:45:05,596 elastic_level: -1
LAUNCH INFO 2024-11-01 06:45:05,596 elastic_timeout: 30
LAUNCH INFO 2024-11-01 06:45:05,596 enable_gpu_log: True
LAUNCH INFO 2024-11-01 06:45:05,596 gloo_port: 6767
LAUNCH INFO 2024-11-01 06:45:05,596 host: None
LAUNCH INFO 2024-11-01 06:45:05,596 ips: None
LAUNCH INFO 2024-11-01 06:45:05,596 job_id: default
LAUNCH INFO 2024-11-01 06:45:05,596 legacy: False
LAUNCH INFO 2024-11-01 06:45:05,596 log_dir: log
LAUNCH INFO 2024-11-01 06:45:05,596 log_level: INFO
LAUNCH INFO 2024-11-01 06:45:05,596 log_overwrite: False
LAUNCH INFO 2024-11-01 06:45:05,596 master: None
LAUNCH INFO 2024-11-01 06:45:05,596 max_restart: 3
LAUNCH INFO 2024-11-01 06:45:05,596 nnodes: 1
LAUNCH INFO 2024-11-01 06:45:05,597 nproc_per_node: None
LAUNCH INFO 2024-11-01 06:45:05,597 rank: -1
LAUNCH INFO 2024-11-01 06:45:05,597 run_mode: collective
LAUNCH INFO 2024-11-01 06:45:05,597 server_num: None
LAUNCH INFO 2024-11-01 06:45:05,597 servers: 
LAUNCH INFO 2024-11-01 06:45:05,597 sort_ip: False
LAUNCH INFO 2024-11-01 06:45:05,597 start_port: 6070
LAUNCH INFO 2024-11-01 06:45:05,597 trainer_num: None
LAUNCH INFO 2024-11-01 06:45:05,597 trainers: 
LAUNCH INFO 2024-11-01 06:45:05,597 training_script: run_finetune.py
LAUNCH INFO 2024-11-01 06:45:05,597 training_script_args: ['./config/llama/sft_argument.json']
LAUNCH INFO 2024-11-01 06:45:05,597 with_gloo: 1
LAUNCH INFO 2024-11-01 06:45:05,597 --------------------------------------------------
LAUNCH INFO 2024-11-01 06:45:05,597 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-11-01 06:45:05,612 Run Pod: vvsghr, replicas 2, status ready
LAUNCH INFO 2024-11-01 06:45:05,720 Watching Pod: vvsghr, replicas 2, status running
/home/aistudio/.local/lib/python3.8/site-packages/_distutils_hack/__init__.py:31: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
[2024-11-01 06:45:09,243] [    INFO] distributed_strategy.py:333 - distributed strategy initialized
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
I1101 06:45:09.245688 57051 tcp_utils.cc:181] The server starts to listen on IP_ANY:50236
I1101 06:45:09.245944 57051 tcp_utils.cc:130] Successfully connected to 10.44.3.96:50236
I1101 06:45:09.333040 57051 process_group_nccl.cc:151] ProcessGroupNCCL pg_timeout_ 1800000
I1101 06:45:09.333137 57051 process_group_nccl.cc:152] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024-11-01 06:45:09,333] [    INFO] topology.py:375 - Total 2 pipe comm group(s) create successfully!
W1101 06:45:09.334441 57051 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.0, Runtime API Version: 11.8
W1101 06:45:09.335824 57051 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/communication/group.py:128: UserWarning: Current global rank 0 is not in group _default_pg10
  warnings.warn(
[2024-11-01 06:45:11,712] [    INFO] topology.py:375 - Total 2 data comm group(s) create successfully!
[2024-11-01 06:45:11,712] [    INFO] topology.py:375 - Total 2 model comm group(s) create successfully!
I1101 06:45:11.712544 57051 process_group_nccl.cc:151] ProcessGroupNCCL pg_timeout_ 1800000
I1101 06:45:11.712577 57051 process_group_nccl.cc:152] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024-11-01 06:45:11,712] [    INFO] topology.py:375 - Total 1 sharding comm group(s) create successfully!
I1101 06:45:11.712703 57051 process_group_nccl.cc:151] ProcessGroupNCCL pg_timeout_ 1800000
I1101 06:45:11.712713 57051 process_group_nccl.cc:152] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024-11-01 06:45:11,712] [    INFO] topology.py:295 - HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 2, pp_degree: 1, dp_degree: 1, sep_degree: 1, mp_group: [0],  sharding_group: [0, 1], pp_group: [0], dp_group: [0], sep:group: None, check/clip group: [0, 1]
[2024-11-01 06:45:11,712] [    INFO] -     +==============================================================================+
    |                                                                              |
    |                         DistributedStrategy Overview                         |
    |                                                                              |
    +==============================================================================+
    |                        a_sync=True <-> a_sync_configs                        |
    +------------------------------------------------------------------------------+
    |                               k_steps                    -1                  |
    |                     max_merge_var_num                    1                   |
    |                       send_queue_size                    16                  |
    |               independent_recv_thread                  False                 |
    |         min_send_grad_num_before_recv                    1                   |
    |                      thread_pool_size                    1                   |
    |                       send_wait_times                    1                   |
    |               runtime_split_send_recv                  False                 |
    |                        launch_barrier                   True                 |
    |             heter_worker_device_guard                   cpu                  |
    |                        lr_decay_steps                    10                  |
    |                            use_ps_gpu                    0                   |
    |                         use_gpu_graph                    0                   |
    +==============================================================================+
    |                    Environment Flags, Communication Flags                    |
    +------------------------------------------------------------------------------+
    |                                  mode                    1                   |
    |                               elastic                  False                 |
    |                                  auto                  False                 |
    |                   sync_nccl_allreduce                   True                 |
    |                         nccl_comm_num                    1                   |
    |            use_hierarchical_allreduce                  False                 |
    |   hierarchical_allreduce_inter_nranks                    1                   |
    |                       sync_batch_norm                  False                 |
    |                   fuse_all_reduce_ops                   True                 |
    |                  fuse_grad_size_in_MB                    32                  |
    |              fuse_grad_size_in_TFLOPS                   50.0                 |
    |               cudnn_exhaustive_search                  False                 |
    |             conv_workspace_size_limit                   512                  |
    |    cudnn_batchnorm_spatial_persistent                  False                 |
    |                        fp16_allreduce                  False                 |
    |               last_comm_group_size_MB                   1.0                  |
    |                find_unused_parameters                  False                 |
    |            without_graph_optimization                   True                 |
    |                 fuse_grad_size_in_num                    8                   |
    |                 calc_comm_same_stream                  False                 |
    |                                   asp                  False                 |
    |                       fuse_grad_merge                  False                 |
    |                             semi_auto                  False                 |
    |                            adam_d2sum                  False                 |
    |                           auto_search                  False                 |
    |                        heter_ccl_mode                  False                 |
    |                         is_fl_ps_mode                  False                 |
    |                      with_coordinator                  False                 |
    |                            split_data                   True                 |
    |                  downpour_table_param                    []                  |
    |                       fs_client_param                                        |
    +==============================================================================+
    |                                Build Strategy                                |
    +------------------------------------------------------------------------------+
    |              fuse_elewise_add_act_ops                  False                 |
    |                       fuse_bn_act_ops                  False                 |
    |              fuse_relu_depthwise_conv                  False                 |
    |                    fuse_broadcast_ops                  False                 |
    |                fuse_all_optimizer_ops                  False                 |
    |                        enable_inplace                  False                 |
    |     enable_backward_optimizer_op_deps                   True                 |
    |                 cache_runtime_context                  False                 |
    |                   fuse_bn_add_act_ops                   True                 |
    |                    enable_auto_fusion                  False                 |
    |                          enable_addto                  False                 |
    |              allow_cuda_graph_capture                  False                 |
    |                       reduce_strategy                    0                   |
    |                    fuse_gemm_epilogue                  False                 |
    |                   debug_graphviz_path                                        |
    |                       fused_attention                  False                 |
    |                     fused_feedforward                  False                 |
    |            fuse_dot_product_attention                  False                 |
    |                          fuse_resunit                  False                 |
    +==============================================================================+

[2024-11-01 06:45:11,713] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2024-11-01 06:45:11,713] [   DEBUG] - ============================================================
[2024-11-01 06:45:11,713] [   DEBUG] -      Model Configuration Arguments      
[2024-11-01 06:45:11,713] [   DEBUG] - paddle commit id              : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,713] [   DEBUG] - paddlenlp commit id           : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,714] [   DEBUG] - aistudio_repo_id              : None
[2024-11-01 06:45:11,714] [   DEBUG] - aistudio_repo_license         : Apache License 2.0
[2024-11-01 06:45:11,714] [   DEBUG] - aistudio_repo_private         : True
[2024-11-01 06:45:11,714] [   DEBUG] - aistudio_token                : None
[2024-11-01 06:45:11,714] [   DEBUG] - attention_probs_dropout_prob  : 0.1
[2024-11-01 06:45:11,714] [   DEBUG] - continue_training             : True
[2024-11-01 06:45:11,714] [   DEBUG] - flash_mask                    : False
[2024-11-01 06:45:11,714] [   DEBUG] - from_aistudio                 : False
[2024-11-01 06:45:11,714] [   DEBUG] - fuse_attention_ffn            : None
[2024-11-01 06:45:11,714] [   DEBUG] - fuse_attention_qkv            : None
[2024-11-01 06:45:11,714] [   DEBUG] - hidden_dropout_prob           : 0.1
[2024-11-01 06:45:11,714] [   DEBUG] - lora                          : False
[2024-11-01 06:45:11,714] [   DEBUG] - lora_path                     : None
[2024-11-01 06:45:11,714] [   DEBUG] - lora_plus_scale               : 1.0
[2024-11-01 06:45:11,714] [   DEBUG] - lora_rank                     : 8
[2024-11-01 06:45:11,714] [   DEBUG] - model_name_or_path            : meta-llama/Llama-2-7b
[2024-11-01 06:45:11,715] [   DEBUG] - neftune                       : False
[2024-11-01 06:45:11,715] [   DEBUG] - neftune_noise_alpha           : 5.0
[2024-11-01 06:45:11,715] [   DEBUG] - num_prefix_tokens             : 128
[2024-11-01 06:45:11,715] [   DEBUG] - pissa                         : False
[2024-11-01 06:45:11,715] [   DEBUG] - prefix_path                   : None
[2024-11-01 06:45:11,715] [   DEBUG] - prefix_tuning                 : False
[2024-11-01 06:45:11,715] [   DEBUG] - rslora                        : False
[2024-11-01 06:45:11,715] [   DEBUG] - save_to_aistudio              : False
[2024-11-01 06:45:11,715] [   DEBUG] - tokenizer_name_or_path        : None
[2024-11-01 06:45:11,715] [   DEBUG] - use_fast_layer_norm           : False
[2024-11-01 06:45:11,715] [   DEBUG] - use_quick_lora                : False
[2024-11-01 06:45:11,715] [   DEBUG] - vera                          : False
[2024-11-01 06:45:11,715] [   DEBUG] - vera_rank                     : 8
[2024-11-01 06:45:11,715] [   DEBUG] - weight_blocksize              : 64
[2024-11-01 06:45:11,715] [   DEBUG] - weight_double_quant           : False
[2024-11-01 06:45:11,715] [   DEBUG] - weight_double_quant_block_size: 256
[2024-11-01 06:45:11,716] [   DEBUG] - weight_quantize_algo          : None
[2024-11-01 06:45:11,716] [   DEBUG] - 
[2024-11-01 06:45:11,716] [   DEBUG] - ============================================================
[2024-11-01 06:45:11,716] [   DEBUG] -       Data Configuration Arguments      
[2024-11-01 06:45:11,716] [   DEBUG] - paddle commit id              : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,716] [   DEBUG] - paddlenlp commit id           : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,716] [   DEBUG] - chat_template                 : None
[2024-11-01 06:45:11,716] [   DEBUG] - dataset_name_or_path          : /home/aistudio/llama/data
[2024-11-01 06:45:11,716] [   DEBUG] - eval_with_do_generation       : False
[2024-11-01 06:45:11,716] [   DEBUG] - greedy_zero_padding           : False
[2024-11-01 06:45:11,716] [   DEBUG] - intokens                      : None
[2024-11-01 06:45:11,716] [   DEBUG] - lazy                          : False
[2024-11-01 06:45:11,716] [   DEBUG] - max_length                    : 2048
[2024-11-01 06:45:11,716] [   DEBUG] - pad_to_max_length             : False
[2024-11-01 06:45:11,716] [   DEBUG] - pad_to_multiple_of            : None
[2024-11-01 06:45:11,716] [   DEBUG] - save_generation_output        : False
[2024-11-01 06:45:11,717] [   DEBUG] - src_length                    : 1024
[2024-11-01 06:45:11,717] [   DEBUG] - task_name                     : None
[2024-11-01 06:45:11,717] [   DEBUG] - task_name_or_path             : None
[2024-11-01 06:45:11,717] [   DEBUG] - zero_padding                  : False
[2024-11-01 06:45:11,717] [   DEBUG] - 
[2024-11-01 06:45:11,717] [   DEBUG] - ============================================================
[2024-11-01 06:45:11,717] [   DEBUG] -      Quant Configuration Arguments      
[2024-11-01 06:45:11,717] [   DEBUG] - paddle commit id              : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,717] [   DEBUG] - paddlenlp commit id           : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,717] [   DEBUG] - act_quant_method              : avg
[2024-11-01 06:45:11,717] [   DEBUG] - auto_clip                     : False
[2024-11-01 06:45:11,717] [   DEBUG] - autoclip_step                 : 8
[2024-11-01 06:45:11,717] [   DEBUG] - awq_step                      : 8
[2024-11-01 06:45:11,717] [   DEBUG] - cachekv_quant_method          : avg_headwise
[2024-11-01 06:45:11,717] [   DEBUG] - do_awq                        : False
[2024-11-01 06:45:11,717] [   DEBUG] - do_gptq                       : False
[2024-11-01 06:45:11,718] [   DEBUG] - do_ptq                        : False
[2024-11-01 06:45:11,718] [   DEBUG] - do_qat                        : False
[2024-11-01 06:45:11,718] [   DEBUG] - do_quant_debug                : False
[2024-11-01 06:45:11,718] [   DEBUG] - fp8_type                      : ['e4m3', 'e4m3']
[2024-11-01 06:45:11,718] [   DEBUG] - gptq_step                     : 8
[2024-11-01 06:45:11,718] [   DEBUG] - load_quant_model              : False
[2024-11-01 06:45:11,718] [   DEBUG] - ptq_step                      : 32
[2024-11-01 06:45:11,718] [   DEBUG] - quant_type                    : a8w8
[2024-11-01 06:45:11,718] [   DEBUG] - search_alpha_max              : 0.8
[2024-11-01 06:45:11,718] [   DEBUG] - search_alpha_min              : 0.2
[2024-11-01 06:45:11,718] [   DEBUG] - search_scale_max              : 5.0
[2024-11-01 06:45:11,718] [   DEBUG] - search_scale_min              : 1.0
[2024-11-01 06:45:11,718] [   DEBUG] - shift                         : False
[2024-11-01 06:45:11,718] [   DEBUG] - shift_all_linears             : False
[2024-11-01 06:45:11,718] [   DEBUG] - shift_sampler                 : ema
[2024-11-01 06:45:11,718] [   DEBUG] - shift_step                    : 32
[2024-11-01 06:45:11,718] [   DEBUG] - skip_list_names               : None
[2024-11-01 06:45:11,719] [   DEBUG] - smooth                        : False
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_all_linears            : False
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_k_piece                : 3
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_piecewise_search       : False
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_sampler                : none
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_search_piece           : False
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_step                   : 32
[2024-11-01 06:45:11,719] [   DEBUG] - test_sample                   : None
[2024-11-01 06:45:11,719] [   DEBUG] - weight_quant_method           : abs_max_channel_wise
[2024-11-01 06:45:11,719] [   DEBUG] - 
[2024-11-01 06:45:11,719] [   DEBUG] - ============================================================
[2024-11-01 06:45:11,719] [   DEBUG] -    Generation Configuration Arguments   
[2024-11-01 06:45:11,719] [   DEBUG] - paddle commit id              : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,719] [   DEBUG] - paddlenlp commit id           : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,719] [   DEBUG] - top_k                         : 1
[2024-11-01 06:45:11,719] [   DEBUG] - top_p                         : 1.0
[2024-11-01 06:45:11,720] [   DEBUG] - 
[2024-11-01 06:45:11,720] [ WARNING] - Process rank: 0, device: gpu, world_size: 2, distributed training: True, 16-bits training: True
[2024-11-01 06:45:11,721] [    INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 06:45:11,721] [    INFO] - Loading configuration file /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/config.json
[2024-11-01 06:45:11,723] [    INFO] - Final model config: LlamaConfig {
  "alibi": false,
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "dtype": "bfloat16",
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "immediate_clear_past_key_value": false,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "long_sequence_init_args": {},
  "long_sequence_strategy_name": null,
  "long_sequence_strategy_type": null,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "paddlenlp_version": "3.0.0b2.post20241101",
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_scaling_factor": 1.0,
  "rope_scaling_type": null,
  "rope_theta": 10000.0,
  "seq_length": 2048,
  "tensor_parallel_output": false,
  "tie_word_embeddings": false,
  "use_fast_layer_norm": false,
  "use_flash_attention_for_generation": false,
  "use_last_token_for_generation": false,
  "use_long_sequence_strategies": false,
  "vocab_size": 32000
}

[2024-11-01 06:45:11,723] [    INFO] - We are using <class 'paddlenlp.transformers.llama.modeling.LlamaForCausalLM'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 06:45:11,724] [    INFO] - Loading weights file from cache at /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/model_state.pdparams

以下为截图:

  • 模型刚运行, paddlepaddle-gpu 3.0.0.dev20241030 占用显存 960MiB

image

amsgrad 的 Paddle 占用显存 946MiB

image

  • 模型加载后, paddlepaddle-gpu 3.0.0.dev20241030 占用显存 14354MiB

image

amsgrad 的 Paddle 占用显存 14342MiB

image

其中,paddlepaddle-gpu 0.0.0 为本地编译带有 amsgrad 的版本

image

说明:程序没有再往下执行,PaddleNLP 报错,有可能是 PaddleNLP 程序有问题,没有正确的读取输入数据的文件

LAUNCH INFO 2024-11-01 06:38:19,474 Pod failed
LAUNCH ERROR 2024-11-01 06:38:19,474 Container failed !!!
Container rank 1 status failed cmd ['/usr/bin/python', '-u', 'run_finetune.py', './config/llama/sft_argument.json'] code 1 log log/workerlog.1
LAUNCH INFO 2024-11-01 06:38:19,474 ------------------------- ERROR LOG DETAIL -------------------------
ing <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load 'meta-llama/Llama-2-7b'.
Traceback (most recent call last):
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 194, in load_dataset
    reader_cls = import_main_class(path_or_read_func)
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 95, in import_main_class
    module = importlib.import_module(module_path)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'paddlenlp.datasets./home/aistudio/llama/data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 116, in load_from_hf
    hf_datasets = load_hf_dataset(path, name=name, split=splits, **kwargs)
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 56, in load_from_ppnlp
    return origin_load_dataset(path, trust_remote_code=True, *args, **kwargs)
  File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 2132, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 1853, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 1582, in dataset_module_factory
    return LocalDatasetModuleFactoryWithoutScript(
  File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 840, in get_module
    module_name, default_builder_kwargs = infer_module_for_data_files(
  File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 601, in infer_module_for_data_files
    raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
datasets.exceptions.DataFilesNotFoundError: No (supported) data files found in /home/aistudio/llama/data

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_finetune.py", line 735, in <module>
    main()
  File "run_finetune.py", line 313, in main
    train_ds = load_dataset(data_args.dataset_name_or_path, splits=["train"])[0]
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 196, in load_dataset
    datasets = load_from_hf(
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 118, in load_from_hf
    raise FileNotFoundError("Couldn't find the dataset script for '" + path + "' on PaddleNLP or HuggingFace")
FileNotFoundError: Couldn't find the dataset script for '/home/aistudio/llama/data' on PaddleNLP or HuggingFace
LAUNCH INFO 2024-11-01 06:38:19,875 Exit code -15
aistudio@jupyter-942478-8345123:~/PaddleNLP/llm$ python -u  -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/llama/sft_argument.json


^CTraceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/__init__.py", line 37, in <module>
    from .base import core  # noqa: F401
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/__init__.py", line 38, in <module>
    from . import (  # noqa: F401
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/backward.py", line 28, in <module>
    from . import core, framework, log_helper, unique_name
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/framework.py", line 41, in <module>
    from .proto import (
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/proto/data_feed_pb2.py", line 5, in <module>
    from google.protobuf.internal import builder as _builder
  File "/home/aistudio/.local/lib/python3.8/site-packages/google/protobuf/internal/builder.py", line 18, in <module>
    from google.protobuf.internal import python_message
  File "/home/aistudio/.local/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 39, in <module>
    from google.protobuf import text_format
  File "/home/aistudio/.local/lib/python3.8/site-packages/google/protobuf/text_format.py", line 33, in <module>
    from google.protobuf import unknown_fields
  File "<frozen importlib._bootstrap>", line 1042, in _handle_fromlist
KeyboardInterrupt

我已经根据说明把数据文件放进目录了

image

不过,这个应该不影响咱们显存的分析 ~

目前为止,PaddleNLP 能够使用 amsgrad 的 Paddle 正常加载模型,也没有发现显存异常增大的情况 ~

之前的内部测试发现的问题,测试环境、测试过程、测试版本分别是多少?显存增大多少?如何判断是 amsgrad 导致的问题?

另外,如果只是测试 resnet 或者 llama 的话没啥意义,总不能把所有模型都测试一遍 ~ 测试的 target 是什么?或者说关注点是什么?看看还有什么地方需要单独测试关注?

感谢!~~~


更新:

刚才重新下了测试数据,https://bj.bcebos.com/paddlenlp/datasets/examples/alpaca_demo.gz 能够进入训练流程,但是显存不够(Paddle 官方版本和 amsgrad 版本一样),没法继续训练了

[2024-11-01 07:51:51,706] [    INFO] - We are using <class 'paddlenlp.transformers.llama.modeling.LlamaForCausalLM'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 07:51:51,707] [    INFO] - Loading weights file from cache at /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/model_state.pdparams
[2024-11-01 07:53:42,486] [    INFO] - Loaded weights file from disk, setting weights to model.
[2024-11-01 07:55:05,265] [    INFO] - All model checkpoint weights were used when initializing LlamaForCausalLM.

[2024-11-01 07:55:05,265] [    INFO] - All the weights of LlamaForCausalLM were initialized from the model checkpoint at meta-llama/Llama-2-7b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[2024-11-01 07:55:05,268] [    INFO] - Loading configuration file /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/generation_config.json
[2024-11-01 07:55:06,067] [    INFO] - We are using <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 07:55:07,259] [    INFO] - The global seed is set to 42, local seed is set to 44 and random seed is set to 42.
[2024-11-01 07:55:07,521] [    INFO] - Using half precision
[2024-11-01 07:55:07,522] [   DEBUG] - ============================================================
[2024-11-01 07:55:07,522] [   DEBUG] -     Training Configuration Arguments    
[2024-11-01 07:55:07,523] [   DEBUG] - paddle commit id              : cead7f59d4f01bc5aff7a78206ca393f1ace553b
[2024-11-01 07:55:07,523] [   DEBUG] - paddlenlp commit id           : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 07:55:07,523] [   DEBUG] - _no_sync_in_gradient_accumulation: True
[2024-11-01 07:55:07,523] [   DEBUG] - adam_beta1                    : 0.9
[2024-11-01 07:55:07,523] [   DEBUG] - adam_beta2                    : 0.999
[2024-11-01 07:55:07,523] [   DEBUG] - adam_epsilon                  : 1e-08
[2024-11-01 07:55:07,523] [   DEBUG] - amp_custom_black_list         : None
[2024-11-01 07:55:07,523] [   DEBUG] - amp_custom_white_list         : None
[2024-11-01 07:55:07,523] [   DEBUG] - amp_master_grad               : False
[2024-11-01 07:55:07,523] [   DEBUG] - auto_parallel_resume_form_hybrid_parallel: False
[2024-11-01 07:55:07,523] [   DEBUG] - autotuner_benchmark           : False
[2024-11-01 07:55:07,523] [   DEBUG] - benchmark                     : False
[2024-11-01 07:55:07,523] [   DEBUG] - bf16                          : True
[2024-11-01 07:55:07,523] [   DEBUG] - bf16_full_eval                : False
[2024-11-01 07:55:07,524] [   DEBUG] - context_parallel_degree       : 1
[2024-11-01 07:55:07,524] [   DEBUG] - current_device                : gpu:0
[2024-11-01 07:55:07,524] [   DEBUG] - data_parallel_config          : 
[2024-11-01 07:55:07,524] [   DEBUG] - data_parallel_degree          : 1
[2024-11-01 07:55:07,524] [   DEBUG] - data_parallel_rank            : 0
[2024-11-01 07:55:07,524] [   DEBUG] - dataloader_drop_last          : False
[2024-11-01 07:55:07,524] [   DEBUG] - dataloader_num_workers        : 0
[2024-11-01 07:55:07,524] [   DEBUG] - dataset_rank                  : 0
[2024-11-01 07:55:07,524] [   DEBUG] - dataset_world_size            : 2
[2024-11-01 07:55:07,524] [   DEBUG] - ddp_find_unused_parameters    : None
[2024-11-01 07:55:07,524] [   DEBUG] - decay_steps                   : 0
[2024-11-01 07:55:07,524] [   DEBUG] - device                        : gpu
[2024-11-01 07:55:07,524] [   DEBUG] - disable_tqdm                  : True
[2024-11-01 07:55:07,524] [   DEBUG] - distributed_dataloader        : False
[2024-11-01 07:55:07,524] [   DEBUG] - do_eval                       : True
[2024-11-01 07:55:07,525] [   DEBUG] - do_export                     : False
[2024-11-01 07:55:07,525] [   DEBUG] - do_predict                    : False
[2024-11-01 07:55:07,525] [   DEBUG] - do_train                      : True
[2024-11-01 07:55:07,525] [   DEBUG] - enable_auto_parallel          : False
[2024-11-01 07:55:07,525] [   DEBUG] - eval_accumulation_steps       : 16
[2024-11-01 07:55:07,525] [   DEBUG] - eval_batch_size               : 8
[2024-11-01 07:55:07,525] [   DEBUG] - eval_steps                    : None
[2024-11-01 07:55:07,525] [   DEBUG] - evaluation_strategy           : IntervalStrategy.EPOCH
[2024-11-01 07:55:07,525] [   DEBUG] - flatten_param_grads           : False
[2024-11-01 07:55:07,525] [   DEBUG] - force_reshard_pp              : False
[2024-11-01 07:55:07,525] [   DEBUG] - fp16                          : False
[2024-11-01 07:55:07,525] [   DEBUG] - fp16_full_eval                : False
[2024-11-01 07:55:07,525] [   DEBUG] - fp16_opt_level                : O2
[2024-11-01 07:55:07,525] [   DEBUG] - fuse_sequence_parallel_allreduce: False
[2024-11-01 07:55:07,525] [   DEBUG] - gradient_accumulation_steps   : 2
[2024-11-01 07:55:07,525] [   DEBUG] - greater_is_better             : True
[2024-11-01 07:55:07,526] [   DEBUG] - hybrid_parallel_topo_order    : pp_first
[2024-11-01 07:55:07,526] [   DEBUG] - ignore_data_skip              : False
[2024-11-01 07:55:07,526] [   DEBUG] - ignore_load_lr_and_optim      : False
[2024-11-01 07:55:07,526] [   DEBUG] - ignore_save_lr_and_optim      : False
[2024-11-01 07:55:07,526] [   DEBUG] - label_names                   : None
[2024-11-01 07:55:07,526] [   DEBUG] - lazy_data_processing          : True
[2024-11-01 07:55:07,526] [   DEBUG] - learning_rate                 : 3e-05
[2024-11-01 07:55:07,526] [   DEBUG] - load_best_model_at_end        : True
[2024-11-01 07:55:07,526] [   DEBUG] - load_sharded_model            : False
[2024-11-01 07:55:07,526] [   DEBUG] - local_process_index           : 0
[2024-11-01 07:55:07,526] [   DEBUG] - local_rank                    : 0
[2024-11-01 07:55:07,526] [   DEBUG] - log_level                     : -1
[2024-11-01 07:55:07,526] [   DEBUG] - log_level_replica             : -1
[2024-11-01 07:55:07,526] [   DEBUG] - log_on_each_node              : True
[2024-11-01 07:55:07,526] [   DEBUG] - logging_dir                   : /home/aistudio/llama/checkpoints/runs/Nov01_07-51-48_jupyter-942478-8345123
[2024-11-01 07:55:07,526] [   DEBUG] - logging_first_step            : False
[2024-11-01 07:55:07,527] [   DEBUG] - logging_steps                 : 1
[2024-11-01 07:55:07,527] [   DEBUG] - logging_strategy              : IntervalStrategy.STEPS
[2024-11-01 07:55:07,527] [   DEBUG] - logical_process_index         : 0
[2024-11-01 07:55:07,527] [   DEBUG] - lr_end                        : 1e-07
[2024-11-01 07:55:07,527] [   DEBUG] - lr_scheduler_type             : SchedulerType.LINEAR
[2024-11-01 07:55:07,527] [   DEBUG] - max_evaluate_steps            : -1
[2024-11-01 07:55:07,527] [   DEBUG] - max_grad_norm                 : 1.0
[2024-11-01 07:55:07,527] [   DEBUG] - max_steps                     : -1
[2024-11-01 07:55:07,527] [   DEBUG] - metric_for_best_model         : accuracy
[2024-11-01 07:55:07,527] [   DEBUG] - minimum_eval_times            : None
[2024-11-01 07:55:07,527] [   DEBUG] - no_cuda                       : False
[2024-11-01 07:55:07,527] [   DEBUG] - no_recompute_layers           : None
[2024-11-01 07:55:07,527] [   DEBUG] - num_cycles                    : 0.5
[2024-11-01 07:55:07,527] [   DEBUG] - num_train_epochs              : 1.0
[2024-11-01 07:55:07,527] [   DEBUG] - optim                         : OptimizerNames.ADAMW
[2024-11-01 07:55:07,527] [   DEBUG] - optimizer_name_suffix         : shard00
[2024-11-01 07:55:07,528] [   DEBUG] - output_dir                    : /home/aistudio/llama/checkpoints
[2024-11-01 07:55:07,528] [   DEBUG] - output_signal_dir             : /home/aistudio/llama/checkpoints
[2024-11-01 07:55:07,528] [   DEBUG] - overwrite_output_dir          : False
[2024-11-01 07:55:07,528] [   DEBUG] - past_index                    : -1
[2024-11-01 07:55:07,528] [   DEBUG] - per_device_eval_batch_size    : 8
[2024-11-01 07:55:07,528] [   DEBUG] - per_device_train_batch_size   : 1
[2024-11-01 07:55:07,528] [   DEBUG] - pipeline_parallel_config      : disable_p2p_cache_shape
[2024-11-01 07:55:07,528] [   DEBUG] - pipeline_parallel_degree      : 1
[2024-11-01 07:55:07,528] [   DEBUG] - pipeline_parallel_rank        : 0
[2024-11-01 07:55:07,528] [   DEBUG] - power                         : 1.0
[2024-11-01 07:55:07,528] [   DEBUG] - pp_recompute_interval         : 1
[2024-11-01 07:55:07,528] [   DEBUG] - prediction_loss_only          : False
[2024-11-01 07:55:07,528] [   DEBUG] - process_index                 : 0
[2024-11-01 07:55:07,528] [   DEBUG] - recompute                     : False
[2024-11-01 07:55:07,528] [   DEBUG] - recompute_granularity         : full
[2024-11-01 07:55:07,528] [   DEBUG] - recompute_use_reentrant       : False
[2024-11-01 07:55:07,529] [   DEBUG] - release_grads                 : False
[2024-11-01 07:55:07,529] [   DEBUG] - remove_unused_columns         : True
[2024-11-01 07:55:07,529] [   DEBUG] - report_to                     : ['visualdl']
[2024-11-01 07:55:07,529] [   DEBUG] - resume_from_checkpoint        : None
[2024-11-01 07:55:07,529] [   DEBUG] - run_name                      : /home/aistudio/llama/checkpoints
[2024-11-01 07:55:07,529] [   DEBUG] - save_on_each_node             : False
[2024-11-01 07:55:07,529] [   DEBUG] - save_sharded_model            : False
[2024-11-01 07:55:07,529] [   DEBUG] - save_steps                    : 500
[2024-11-01 07:55:07,529] [   DEBUG] - save_strategy                 : IntervalStrategy.EPOCH
[2024-11-01 07:55:07,529] [   DEBUG] - save_total_limit              : 1
[2024-11-01 07:55:07,529] [   DEBUG] - scale_loss                    : 32768
[2024-11-01 07:55:07,529] [   DEBUG] - seed                          : 42
[2024-11-01 07:55:07,529] [   DEBUG] - sep_parallel_degree           : 1
[2024-11-01 07:55:07,529] [   DEBUG] - sequence_parallel             : False
[2024-11-01 07:55:07,529] [   DEBUG] - sequence_parallel_config      : 
[2024-11-01 07:55:07,529] [   DEBUG] - sharding                      : [<ShardingOption.SHARD_GRAD_OP: 'stage2'>]
[2024-11-01 07:55:07,530] [   DEBUG] - sharding_comm_buffer_size_MB  : -1
[2024-11-01 07:55:07,530] [   DEBUG] - sharding_degree               : -1
[2024-11-01 07:55:07,530] [   DEBUG] - sharding_parallel_config      : 
[2024-11-01 07:55:07,530] [   DEBUG] - sharding_parallel_degree      : 2
[2024-11-01 07:55:07,530] [   DEBUG] - sharding_parallel_rank        : 0
[2024-11-01 07:55:07,530] [   DEBUG] - should_load_dataset           : True
[2024-11-01 07:55:07,530] [   DEBUG] - should_load_sharding_stage1_model: False
[2024-11-01 07:55:07,530] [   DEBUG] - should_log                    : True
[2024-11-01 07:55:07,530] [   DEBUG] - should_save                   : True
[2024-11-01 07:55:07,530] [   DEBUG] - should_save_model_state       : True
[2024-11-01 07:55:07,530] [   DEBUG] - should_save_sharding_stage1_model: False
[2024-11-01 07:55:07,530] [   DEBUG] - skip_data_intervals           : None
[2024-11-01 07:55:07,530] [   DEBUG] - skip_memory_metrics           : True
[2024-11-01 07:55:07,530] [   DEBUG] - skip_profile_timer            : True
[2024-11-01 07:55:07,530] [   DEBUG] - tensor_parallel_config        : 
[2024-11-01 07:55:07,530] [   DEBUG] - tensor_parallel_degree        : 1
[2024-11-01 07:55:07,530] [   DEBUG] - tensor_parallel_output        : False
[2024-11-01 07:55:07,531] [   DEBUG] - tensor_parallel_rank          : 0
[2024-11-01 07:55:07,531] [   DEBUG] - to_static                     : False
[2024-11-01 07:55:07,531] [   DEBUG] - train_batch_size              : 1
[2024-11-01 07:55:07,531] [   DEBUG] - unified_checkpoint            : True
[2024-11-01 07:55:07,531] [   DEBUG] - unified_checkpoint_config     : ['']
[2024-11-01 07:55:07,531] [   DEBUG] - use_async_save                : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_expert_parallel           : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_flash_attention           : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_fused_dropout_add         : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_fused_linear              : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_fused_rms_norm            : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_fused_rope                : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_hybrid_parallel           : True
[2024-11-01 07:55:07,531] [   DEBUG] - virtual_pp_degree             : 1
[2024-11-01 07:55:07,531] [   DEBUG] - wandb_api_key                 : None
[2024-11-01 07:55:07,531] [   DEBUG] - warmup_ratio                  : 0.0
[2024-11-01 07:55:07,532] [   DEBUG] - warmup_steps                  : 30
[2024-11-01 07:55:07,532] [   DEBUG] - weight_decay                  : 0.0
[2024-11-01 07:55:07,532] [   DEBUG] - weight_name_suffix            : 
[2024-11-01 07:55:07,532] [   DEBUG] - world_size                    : 2
[2024-11-01 07:55:07,532] [   DEBUG] - 
[2024-11-01 07:55:07,533] [    INFO] - Starting training from resume_from_checkpoint : None
/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/communication/group.py:128: UserWarning: Current global rank 0 is not in group _default_pg12
  warnings.warn(
WARNING:root:While using ClipGradByGlobalNorm in GroupShardedOptimizerStage2, the grad clip of original optimizer will be changed.
Traceback (most recent call last):
  File "run_finetune.py", line 735, in <module>
    main()
  File "run_finetune.py", line 575, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py", line 798, in train
    model = self._wrap_model(self.model_wrapped)
  File "/home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py", line 2053, in _wrap_model
    model, optimizer, _ = group_sharded_parallel(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/sharding/group_sharded.py", line 156, in group_sharded_parallel
    optimizer = GroupShardedOptimizerStage2(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 240, in __init__
    self._update_opt_status()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 343, in _update_opt_status
    self._integration_params()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 463, in _integration_params
    self._generate_master_params(trainable_params)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 336, in _generate_master_params
    master_tensor = paddle.cast(param, Type.fp32.value)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/tensor/manipulation.py", line 237, in cast
    return _C_ops.cast(x, dtype)
MemoryError: 

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_cast(_object*, _object*, _object*)
1   cast_ad_func(paddle::Tensor const&, phi::DataType)
2   paddle::experimental::cast(paddle::Tensor const&, phi::DataType)
3   void phi::CastKernel<phi::dtype::bfloat16, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
4   void phi::CastCUDAKernelImpl<phi::dtype::bfloat16, float>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
5   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
6   phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
7   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::Allocator::Allocate(unsigned long)
12  paddle::memory::allocation::Allocator::Allocate(unsigned long)
13  paddle::memory::allocation::Allocator::Allocate(unsigned long)
14  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
15  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
16  common::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 64.000000MB memory on GPU 0, 15.713623GB memory has been allocated and available memory is only 60.125000MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
 (at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:84)

LAUNCH INFO 2024-11-01 07:55:20,920 Pod failed
LAUNCH ERROR 2024-11-01 07:55:20,921 Container failed !!!
Container rank 0 status failed cmd ['/usr/bin/python', '-u', 'run_finetune.py', './config/llama/sft_argument.json'] code 1 log log/workerlog.0
LAUNCH INFO 2024-11-01 07:55:20,921 ------------------------- ERROR LOG DETAIL -------------------------
stributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 240, in __init__
    self._update_opt_status()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 343, in _update_opt_status
    self._integration_params()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 463, in _integration_params
    self._generate_master_params(trainable_params)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 336, in _generate_master_params
    master_tensor = paddle.cast(param, Type.fp32.value)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/tensor/manipulation.py", line 237, in cast
    return _C_ops.cast(x, dtype)
MemoryError: 

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_cast(_object*, _object*, _object*)
1   cast_ad_func(paddle::Tensor const&, phi::DataType)
2   paddle::experimental::cast(paddle::Tensor const&, phi::DataType)
3   void phi::CastKernel<phi::dtype::bfloat16, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
4   void phi::CastCUDAKernelImpl<phi::dtype::bfloat16, float>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
5   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
6   phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
7   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::Allocator::Allocate(unsigned long)
12  paddle::memory::allocation::Allocator::Allocate(unsigned long)
13  paddle::memory::allocation::Allocator::Allocate(unsigned long)
14  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
15  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
16  common::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 64.000000MB memory on GPU 0, 15.713623GB memory has been allocated and available memory is only 60.125000MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
 (at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:84)

LAUNCH INFO 2024-11-01 07:55:21,936 Exit code 1
aistudio@jupyter-942478-8345123:~/PaddleNLP/llm$ 

这是 paddlepaddle-gpu 3.0.0.dev20241030 的日志 ~ AIStudio 环境最多就是双卡 16G ~ 我也没办法继续测了 🫠

@HydrogenSulfate
Copy link
Contributor

@megemini,感谢提供测试结论,我明天再反馈给相关研发看看

Copy link

paddle-ci-bot bot commented Nov 5, 2024

Sorry to inform you that 6544a48's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@HydrogenSulfate
Copy link
Contributor

@HydrogenSulfate 这两天事情比较多,刚抽空测了下 PaddleNLP 中 llama 关于显存占用的情况 ~

先说结论:AIStudio 上编译的带有 amsgrad 版本的 Paddle ,其显存占用 小于 Paddle 发布的开发版本 paddlepaddle-gpu 3.0.0.dev20241030

AIStudio 上面我是用的双卡的环境测试的,命令如下:

python -u  -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/llama/sft_argument.json

配置文件 sft_argument.json 如下

{
    "model_name_or_path": "meta-llama/Llama-2-7b",
    "dataset_name_or_path": "/home/aistudio/llama/data",
    "output_dir": "/home/aistudio/llama/checkpoints",
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 2,
    "per_device_eval_batch_size": 8,
    "eval_accumulation_steps":16,
    "num_train_epochs": 1,
    "learning_rate": 3e-05,
    "warmup_steps": 30,
    "logging_steps": 1,
    "evaluation_strategy": "epoch",
    "save_strategy": "epoch",
    "src_length": 1024,
    "max_length": 2048,
    "bf16": true,
    "fp16_opt_level": "O2",
    "do_train": true,
    "do_eval": true,
    "disable_tqdm": true,
    "load_best_model_at_end": true,
    "eval_with_do_generation": false,
    "metric_for_best_model": "accuracy",
    "recompute": false,
    "save_total_limit": 1,
    "tensor_parallel_degree": 1,
    "pipeline_parallel_degree": 1,
    "pipeline_parallel_config": "disable_p2p_cache_shape",
    "sharding": "stage2",
    "zero_padding": false,
    "unified_checkpoint": true,
    "use_flash_attention": false
  }

这里只修改了模型名称与存储目录。

运行命令后,输出的日志如下:

aistudio@jupyter-942478-8345123:~/PaddleNLP/llm$ python -u  -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/llama/sft_argument.json
LAUNCH INFO 2024-11-01 06:45:05,596 -----------  Configuration  ----------------------
LAUNCH INFO 2024-11-01 06:45:05,596 auto_cluster_config: 0
LAUNCH INFO 2024-11-01 06:45:05,596 auto_parallel_config: None
LAUNCH INFO 2024-11-01 06:45:05,596 auto_tuner_json: None
LAUNCH INFO 2024-11-01 06:45:05,596 devices: 0,1
LAUNCH INFO 2024-11-01 06:45:05,596 elastic_level: -1
LAUNCH INFO 2024-11-01 06:45:05,596 elastic_timeout: 30
LAUNCH INFO 2024-11-01 06:45:05,596 enable_gpu_log: True
LAUNCH INFO 2024-11-01 06:45:05,596 gloo_port: 6767
LAUNCH INFO 2024-11-01 06:45:05,596 host: None
LAUNCH INFO 2024-11-01 06:45:05,596 ips: None
LAUNCH INFO 2024-11-01 06:45:05,596 job_id: default
LAUNCH INFO 2024-11-01 06:45:05,596 legacy: False
LAUNCH INFO 2024-11-01 06:45:05,596 log_dir: log
LAUNCH INFO 2024-11-01 06:45:05,596 log_level: INFO
LAUNCH INFO 2024-11-01 06:45:05,596 log_overwrite: False
LAUNCH INFO 2024-11-01 06:45:05,596 master: None
LAUNCH INFO 2024-11-01 06:45:05,596 max_restart: 3
LAUNCH INFO 2024-11-01 06:45:05,596 nnodes: 1
LAUNCH INFO 2024-11-01 06:45:05,597 nproc_per_node: None
LAUNCH INFO 2024-11-01 06:45:05,597 rank: -1
LAUNCH INFO 2024-11-01 06:45:05,597 run_mode: collective
LAUNCH INFO 2024-11-01 06:45:05,597 server_num: None
LAUNCH INFO 2024-11-01 06:45:05,597 servers: 
LAUNCH INFO 2024-11-01 06:45:05,597 sort_ip: False
LAUNCH INFO 2024-11-01 06:45:05,597 start_port: 6070
LAUNCH INFO 2024-11-01 06:45:05,597 trainer_num: None
LAUNCH INFO 2024-11-01 06:45:05,597 trainers: 
LAUNCH INFO 2024-11-01 06:45:05,597 training_script: run_finetune.py
LAUNCH INFO 2024-11-01 06:45:05,597 training_script_args: ['./config/llama/sft_argument.json']
LAUNCH INFO 2024-11-01 06:45:05,597 with_gloo: 1
LAUNCH INFO 2024-11-01 06:45:05,597 --------------------------------------------------
LAUNCH INFO 2024-11-01 06:45:05,597 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-11-01 06:45:05,612 Run Pod: vvsghr, replicas 2, status ready
LAUNCH INFO 2024-11-01 06:45:05,720 Watching Pod: vvsghr, replicas 2, status running
/home/aistudio/.local/lib/python3.8/site-packages/_distutils_hack/__init__.py:31: UserWarning: Setuptools is replacing distutils. Support for replacing an already imported distutils is deprecated. In the future, this condition will fail. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml
  warnings.warn(
[2024-11-01 06:45:09,243] [    INFO] distributed_strategy.py:333 - distributed strategy initialized
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
=======================================================================
I1101 06:45:09.245688 57051 tcp_utils.cc:181] The server starts to listen on IP_ANY:50236
I1101 06:45:09.245944 57051 tcp_utils.cc:130] Successfully connected to 10.44.3.96:50236
I1101 06:45:09.333040 57051 process_group_nccl.cc:151] ProcessGroupNCCL pg_timeout_ 1800000
I1101 06:45:09.333137 57051 process_group_nccl.cc:152] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024-11-01 06:45:09,333] [    INFO] topology.py:375 - Total 2 pipe comm group(s) create successfully!
W1101 06:45:09.334441 57051 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 12.0, Runtime API Version: 11.8
W1101 06:45:09.335824 57051 gpu_resources.cc:164] device: 0, cuDNN Version: 8.9.
/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/communication/group.py:128: UserWarning: Current global rank 0 is not in group _default_pg10
  warnings.warn(
[2024-11-01 06:45:11,712] [    INFO] topology.py:375 - Total 2 data comm group(s) create successfully!
[2024-11-01 06:45:11,712] [    INFO] topology.py:375 - Total 2 model comm group(s) create successfully!
I1101 06:45:11.712544 57051 process_group_nccl.cc:151] ProcessGroupNCCL pg_timeout_ 1800000
I1101 06:45:11.712577 57051 process_group_nccl.cc:152] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024-11-01 06:45:11,712] [    INFO] topology.py:375 - Total 1 sharding comm group(s) create successfully!
I1101 06:45:11.712703 57051 process_group_nccl.cc:151] ProcessGroupNCCL pg_timeout_ 1800000
I1101 06:45:11.712713 57051 process_group_nccl.cc:152] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024-11-01 06:45:11,712] [    INFO] topology.py:295 - HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 2, pp_degree: 1, dp_degree: 1, sep_degree: 1, mp_group: [0],  sharding_group: [0, 1], pp_group: [0], dp_group: [0], sep:group: None, check/clip group: [0, 1]
[2024-11-01 06:45:11,712] [    INFO] -     +==============================================================================+
    |                                                                              |
    |                         DistributedStrategy Overview                         |
    |                                                                              |
    +==============================================================================+
    |                        a_sync=True <-> a_sync_configs                        |
    +------------------------------------------------------------------------------+
    |                               k_steps                    -1                  |
    |                     max_merge_var_num                    1                   |
    |                       send_queue_size                    16                  |
    |               independent_recv_thread                  False                 |
    |         min_send_grad_num_before_recv                    1                   |
    |                      thread_pool_size                    1                   |
    |                       send_wait_times                    1                   |
    |               runtime_split_send_recv                  False                 |
    |                        launch_barrier                   True                 |
    |             heter_worker_device_guard                   cpu                  |
    |                        lr_decay_steps                    10                  |
    |                            use_ps_gpu                    0                   |
    |                         use_gpu_graph                    0                   |
    +==============================================================================+
    |                    Environment Flags, Communication Flags                    |
    +------------------------------------------------------------------------------+
    |                                  mode                    1                   |
    |                               elastic                  False                 |
    |                                  auto                  False                 |
    |                   sync_nccl_allreduce                   True                 |
    |                         nccl_comm_num                    1                   |
    |            use_hierarchical_allreduce                  False                 |
    |   hierarchical_allreduce_inter_nranks                    1                   |
    |                       sync_batch_norm                  False                 |
    |                   fuse_all_reduce_ops                   True                 |
    |                  fuse_grad_size_in_MB                    32                  |
    |              fuse_grad_size_in_TFLOPS                   50.0                 |
    |               cudnn_exhaustive_search                  False                 |
    |             conv_workspace_size_limit                   512                  |
    |    cudnn_batchnorm_spatial_persistent                  False                 |
    |                        fp16_allreduce                  False                 |
    |               last_comm_group_size_MB                   1.0                  |
    |                find_unused_parameters                  False                 |
    |            without_graph_optimization                   True                 |
    |                 fuse_grad_size_in_num                    8                   |
    |                 calc_comm_same_stream                  False                 |
    |                                   asp                  False                 |
    |                       fuse_grad_merge                  False                 |
    |                             semi_auto                  False                 |
    |                            adam_d2sum                  False                 |
    |                           auto_search                  False                 |
    |                        heter_ccl_mode                  False                 |
    |                         is_fl_ps_mode                  False                 |
    |                      with_coordinator                  False                 |
    |                            split_data                   True                 |
    |                  downpour_table_param                    []                  |
    |                       fs_client_param                                        |
    +==============================================================================+
    |                                Build Strategy                                |
    +------------------------------------------------------------------------------+
    |              fuse_elewise_add_act_ops                  False                 |
    |                       fuse_bn_act_ops                  False                 |
    |              fuse_relu_depthwise_conv                  False                 |
    |                    fuse_broadcast_ops                  False                 |
    |                fuse_all_optimizer_ops                  False                 |
    |                        enable_inplace                  False                 |
    |     enable_backward_optimizer_op_deps                   True                 |
    |                 cache_runtime_context                  False                 |
    |                   fuse_bn_add_act_ops                   True                 |
    |                    enable_auto_fusion                  False                 |
    |                          enable_addto                  False                 |
    |              allow_cuda_graph_capture                  False                 |
    |                       reduce_strategy                    0                   |
    |                    fuse_gemm_epilogue                  False                 |
    |                   debug_graphviz_path                                        |
    |                       fused_attention                  False                 |
    |                     fused_feedforward                  False                 |
    |            fuse_dot_product_attention                  False                 |
    |                          fuse_resunit                  False                 |
    +==============================================================================+

[2024-11-01 06:45:11,713] [    INFO] - The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2024-11-01 06:45:11,713] [   DEBUG] - ============================================================
[2024-11-01 06:45:11,713] [   DEBUG] -      Model Configuration Arguments      
[2024-11-01 06:45:11,713] [   DEBUG] - paddle commit id              : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,713] [   DEBUG] - paddlenlp commit id           : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,714] [   DEBUG] - aistudio_repo_id              : None
[2024-11-01 06:45:11,714] [   DEBUG] - aistudio_repo_license         : Apache License 2.0
[2024-11-01 06:45:11,714] [   DEBUG] - aistudio_repo_private         : True
[2024-11-01 06:45:11,714] [   DEBUG] - aistudio_token                : None
[2024-11-01 06:45:11,714] [   DEBUG] - attention_probs_dropout_prob  : 0.1
[2024-11-01 06:45:11,714] [   DEBUG] - continue_training             : True
[2024-11-01 06:45:11,714] [   DEBUG] - flash_mask                    : False
[2024-11-01 06:45:11,714] [   DEBUG] - from_aistudio                 : False
[2024-11-01 06:45:11,714] [   DEBUG] - fuse_attention_ffn            : None
[2024-11-01 06:45:11,714] [   DEBUG] - fuse_attention_qkv            : None
[2024-11-01 06:45:11,714] [   DEBUG] - hidden_dropout_prob           : 0.1
[2024-11-01 06:45:11,714] [   DEBUG] - lora                          : False
[2024-11-01 06:45:11,714] [   DEBUG] - lora_path                     : None
[2024-11-01 06:45:11,714] [   DEBUG] - lora_plus_scale               : 1.0
[2024-11-01 06:45:11,714] [   DEBUG] - lora_rank                     : 8
[2024-11-01 06:45:11,714] [   DEBUG] - model_name_or_path            : meta-llama/Llama-2-7b
[2024-11-01 06:45:11,715] [   DEBUG] - neftune                       : False
[2024-11-01 06:45:11,715] [   DEBUG] - neftune_noise_alpha           : 5.0
[2024-11-01 06:45:11,715] [   DEBUG] - num_prefix_tokens             : 128
[2024-11-01 06:45:11,715] [   DEBUG] - pissa                         : False
[2024-11-01 06:45:11,715] [   DEBUG] - prefix_path                   : None
[2024-11-01 06:45:11,715] [   DEBUG] - prefix_tuning                 : False
[2024-11-01 06:45:11,715] [   DEBUG] - rslora                        : False
[2024-11-01 06:45:11,715] [   DEBUG] - save_to_aistudio              : False
[2024-11-01 06:45:11,715] [   DEBUG] - tokenizer_name_or_path        : None
[2024-11-01 06:45:11,715] [   DEBUG] - use_fast_layer_norm           : False
[2024-11-01 06:45:11,715] [   DEBUG] - use_quick_lora                : False
[2024-11-01 06:45:11,715] [   DEBUG] - vera                          : False
[2024-11-01 06:45:11,715] [   DEBUG] - vera_rank                     : 8
[2024-11-01 06:45:11,715] [   DEBUG] - weight_blocksize              : 64
[2024-11-01 06:45:11,715] [   DEBUG] - weight_double_quant           : False
[2024-11-01 06:45:11,715] [   DEBUG] - weight_double_quant_block_size: 256
[2024-11-01 06:45:11,716] [   DEBUG] - weight_quantize_algo          : None
[2024-11-01 06:45:11,716] [   DEBUG] - 
[2024-11-01 06:45:11,716] [   DEBUG] - ============================================================
[2024-11-01 06:45:11,716] [   DEBUG] -       Data Configuration Arguments      
[2024-11-01 06:45:11,716] [   DEBUG] - paddle commit id              : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,716] [   DEBUG] - paddlenlp commit id           : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,716] [   DEBUG] - chat_template                 : None
[2024-11-01 06:45:11,716] [   DEBUG] - dataset_name_or_path          : /home/aistudio/llama/data
[2024-11-01 06:45:11,716] [   DEBUG] - eval_with_do_generation       : False
[2024-11-01 06:45:11,716] [   DEBUG] - greedy_zero_padding           : False
[2024-11-01 06:45:11,716] [   DEBUG] - intokens                      : None
[2024-11-01 06:45:11,716] [   DEBUG] - lazy                          : False
[2024-11-01 06:45:11,716] [   DEBUG] - max_length                    : 2048
[2024-11-01 06:45:11,716] [   DEBUG] - pad_to_max_length             : False
[2024-11-01 06:45:11,716] [   DEBUG] - pad_to_multiple_of            : None
[2024-11-01 06:45:11,716] [   DEBUG] - save_generation_output        : False
[2024-11-01 06:45:11,717] [   DEBUG] - src_length                    : 1024
[2024-11-01 06:45:11,717] [   DEBUG] - task_name                     : None
[2024-11-01 06:45:11,717] [   DEBUG] - task_name_or_path             : None
[2024-11-01 06:45:11,717] [   DEBUG] - zero_padding                  : False
[2024-11-01 06:45:11,717] [   DEBUG] - 
[2024-11-01 06:45:11,717] [   DEBUG] - ============================================================
[2024-11-01 06:45:11,717] [   DEBUG] -      Quant Configuration Arguments      
[2024-11-01 06:45:11,717] [   DEBUG] - paddle commit id              : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,717] [   DEBUG] - paddlenlp commit id           : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,717] [   DEBUG] - act_quant_method              : avg
[2024-11-01 06:45:11,717] [   DEBUG] - auto_clip                     : False
[2024-11-01 06:45:11,717] [   DEBUG] - autoclip_step                 : 8
[2024-11-01 06:45:11,717] [   DEBUG] - awq_step                      : 8
[2024-11-01 06:45:11,717] [   DEBUG] - cachekv_quant_method          : avg_headwise
[2024-11-01 06:45:11,717] [   DEBUG] - do_awq                        : False
[2024-11-01 06:45:11,717] [   DEBUG] - do_gptq                       : False
[2024-11-01 06:45:11,718] [   DEBUG] - do_ptq                        : False
[2024-11-01 06:45:11,718] [   DEBUG] - do_qat                        : False
[2024-11-01 06:45:11,718] [   DEBUG] - do_quant_debug                : False
[2024-11-01 06:45:11,718] [   DEBUG] - fp8_type                      : ['e4m3', 'e4m3']
[2024-11-01 06:45:11,718] [   DEBUG] - gptq_step                     : 8
[2024-11-01 06:45:11,718] [   DEBUG] - load_quant_model              : False
[2024-11-01 06:45:11,718] [   DEBUG] - ptq_step                      : 32
[2024-11-01 06:45:11,718] [   DEBUG] - quant_type                    : a8w8
[2024-11-01 06:45:11,718] [   DEBUG] - search_alpha_max              : 0.8
[2024-11-01 06:45:11,718] [   DEBUG] - search_alpha_min              : 0.2
[2024-11-01 06:45:11,718] [   DEBUG] - search_scale_max              : 5.0
[2024-11-01 06:45:11,718] [   DEBUG] - search_scale_min              : 1.0
[2024-11-01 06:45:11,718] [   DEBUG] - shift                         : False
[2024-11-01 06:45:11,718] [   DEBUG] - shift_all_linears             : False
[2024-11-01 06:45:11,718] [   DEBUG] - shift_sampler                 : ema
[2024-11-01 06:45:11,718] [   DEBUG] - shift_step                    : 32
[2024-11-01 06:45:11,718] [   DEBUG] - skip_list_names               : None
[2024-11-01 06:45:11,719] [   DEBUG] - smooth                        : False
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_all_linears            : False
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_k_piece                : 3
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_piecewise_search       : False
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_sampler                : none
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_search_piece           : False
[2024-11-01 06:45:11,719] [   DEBUG] - smooth_step                   : 32
[2024-11-01 06:45:11,719] [   DEBUG] - test_sample                   : None
[2024-11-01 06:45:11,719] [   DEBUG] - weight_quant_method           : abs_max_channel_wise
[2024-11-01 06:45:11,719] [   DEBUG] - 
[2024-11-01 06:45:11,719] [   DEBUG] - ============================================================
[2024-11-01 06:45:11,719] [   DEBUG] -    Generation Configuration Arguments   
[2024-11-01 06:45:11,719] [   DEBUG] - paddle commit id              : 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
[2024-11-01 06:45:11,719] [   DEBUG] - paddlenlp commit id           : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 06:45:11,719] [   DEBUG] - top_k                         : 1
[2024-11-01 06:45:11,719] [   DEBUG] - top_p                         : 1.0
[2024-11-01 06:45:11,720] [   DEBUG] - 
[2024-11-01 06:45:11,720] [ WARNING] - Process rank: 0, device: gpu, world_size: 2, distributed training: True, 16-bits training: True
[2024-11-01 06:45:11,721] [    INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 06:45:11,721] [    INFO] - Loading configuration file /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/config.json
[2024-11-01 06:45:11,723] [    INFO] - Final model config: LlamaConfig {
  "alibi": false,
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "dtype": "bfloat16",
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "immediate_clear_past_key_value": false,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "long_sequence_init_args": {},
  "long_sequence_strategy_name": null,
  "long_sequence_strategy_type": null,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "paddlenlp_version": "3.0.0b2.post20241101",
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_scaling_factor": 1.0,
  "rope_scaling_type": null,
  "rope_theta": 10000.0,
  "seq_length": 2048,
  "tensor_parallel_output": false,
  "tie_word_embeddings": false,
  "use_fast_layer_norm": false,
  "use_flash_attention_for_generation": false,
  "use_last_token_for_generation": false,
  "use_long_sequence_strategies": false,
  "vocab_size": 32000
}

[2024-11-01 06:45:11,723] [    INFO] - We are using <class 'paddlenlp.transformers.llama.modeling.LlamaForCausalLM'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 06:45:11,724] [    INFO] - Loading weights file from cache at /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/model_state.pdparams

以下为截图:

  • 模型刚运行, paddlepaddle-gpu 3.0.0.dev20241030 占用显存 960MiB

image

amsgrad 的 Paddle 占用显存 946MiB

image

  • 模型加载后, paddlepaddle-gpu 3.0.0.dev20241030 占用显存 14354MiB

image

amsgrad 的 Paddle 占用显存 14342MiB

image

其中,paddlepaddle-gpu 0.0.0 为本地编译带有 amsgrad 的版本

image

说明:程序没有再往下执行,PaddleNLP 报错,有可能是 PaddleNLP 程序有问题,没有正确的读取输入数据的文件

LAUNCH INFO 2024-11-01 06:38:19,474 Pod failed
LAUNCH ERROR 2024-11-01 06:38:19,474 Container failed !!!
Container rank 1 status failed cmd ['/usr/bin/python', '-u', 'run_finetune.py', './config/llama/sft_argument.json'] code 1 log log/workerlog.1
LAUNCH INFO 2024-11-01 06:38:19,474 ------------------------- ERROR LOG DETAIL -------------------------
ing <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load 'meta-llama/Llama-2-7b'.
Traceback (most recent call last):
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 194, in load_dataset
    reader_cls = import_main_class(path_or_read_func)
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 95, in import_main_class
    module = importlib.import_module(module_path)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'paddlenlp.datasets./home/aistudio/llama/data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 116, in load_from_hf
    hf_datasets = load_hf_dataset(path, name=name, split=splits, **kwargs)
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 56, in load_from_ppnlp
    return origin_load_dataset(path, trust_remote_code=True, *args, **kwargs)
  File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 2132, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 1853, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 1582, in dataset_module_factory
    return LocalDatasetModuleFactoryWithoutScript(
  File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 840, in get_module
    module_name, default_builder_kwargs = infer_module_for_data_files(
  File "/home/aistudio/.local/lib/python3.8/site-packages/datasets/load.py", line 601, in infer_module_for_data_files
    raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
datasets.exceptions.DataFilesNotFoundError: No (supported) data files found in /home/aistudio/llama/data

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_finetune.py", line 735, in <module>
    main()
  File "run_finetune.py", line 313, in main
    train_ds = load_dataset(data_args.dataset_name_or_path, splits=["train"])[0]
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 196, in load_dataset
    datasets = load_from_hf(
  File "/home/aistudio/PaddleNLP/paddlenlp/datasets/dataset.py", line 118, in load_from_hf
    raise FileNotFoundError("Couldn't find the dataset script for '" + path + "' on PaddleNLP or HuggingFace")
FileNotFoundError: Couldn't find the dataset script for '/home/aistudio/llama/data' on PaddleNLP or HuggingFace
LAUNCH INFO 2024-11-01 06:38:19,875 Exit code -15
aistudio@jupyter-942478-8345123:~/PaddleNLP/llm$ python -u  -m paddle.distributed.launch --gpus "0,1" run_finetune.py ./config/llama/sft_argument.json


^CTraceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/__init__.py", line 37, in <module>
    from .base import core  # noqa: F401
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/__init__.py", line 38, in <module>
    from . import (  # noqa: F401
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/backward.py", line 28, in <module>
    from . import core, framework, log_helper, unique_name
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/framework.py", line 41, in <module>
    from .proto import (
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/base/proto/data_feed_pb2.py", line 5, in <module>
    from google.protobuf.internal import builder as _builder
  File "/home/aistudio/.local/lib/python3.8/site-packages/google/protobuf/internal/builder.py", line 18, in <module>
    from google.protobuf.internal import python_message
  File "/home/aistudio/.local/lib/python3.8/site-packages/google/protobuf/internal/python_message.py", line 39, in <module>
    from google.protobuf import text_format
  File "/home/aistudio/.local/lib/python3.8/site-packages/google/protobuf/text_format.py", line 33, in <module>
    from google.protobuf import unknown_fields
  File "<frozen importlib._bootstrap>", line 1042, in _handle_fromlist
KeyboardInterrupt

我已经根据说明把数据文件放进目录了

image

不过,这个应该不影响咱们显存的分析 ~

目前为止,PaddleNLP 能够使用 amsgrad 的 Paddle 正常加载模型,也没有发现显存异常增大的情况 ~

之前的内部测试发现的问题,测试环境、测试过程、测试版本分别是多少?显存增大多少?如何判断是 amsgrad 导致的问题?

另外,如果只是测试 resnet 或者 llama 的话没啥意义,总不能把所有模型都测试一遍 ~ 测试的 target 是什么?或者说关注点是什么?看看还有什么地方需要单独测试关注?

感谢!~~~

更新:

刚才重新下了测试数据,https://bj.bcebos.com/paddlenlp/datasets/examples/alpaca_demo.gz 能够进入训练流程,但是显存不够(Paddle 官方版本和 amsgrad 版本一样),没法继续训练了

[2024-11-01 07:51:51,706] [    INFO] - We are using <class 'paddlenlp.transformers.llama.modeling.LlamaForCausalLM'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 07:51:51,707] [    INFO] - Loading weights file from cache at /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/model_state.pdparams
[2024-11-01 07:53:42,486] [    INFO] - Loaded weights file from disk, setting weights to model.
[2024-11-01 07:55:05,265] [    INFO] - All model checkpoint weights were used when initializing LlamaForCausalLM.

[2024-11-01 07:55:05,265] [    INFO] - All the weights of LlamaForCausalLM were initialized from the model checkpoint at meta-llama/Llama-2-7b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[2024-11-01 07:55:05,268] [    INFO] - Loading configuration file /home/aistudio/.paddlenlp/models/meta-llama/Llama-2-7b/generation_config.json
[2024-11-01 07:55:06,067] [    INFO] - We are using <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load 'meta-llama/Llama-2-7b'.
[2024-11-01 07:55:07,259] [    INFO] - The global seed is set to 42, local seed is set to 44 and random seed is set to 42.
[2024-11-01 07:55:07,521] [    INFO] - Using half precision
[2024-11-01 07:55:07,522] [   DEBUG] - ============================================================
[2024-11-01 07:55:07,522] [   DEBUG] -     Training Configuration Arguments    
[2024-11-01 07:55:07,523] [   DEBUG] - paddle commit id              : cead7f59d4f01bc5aff7a78206ca393f1ace553b
[2024-11-01 07:55:07,523] [   DEBUG] - paddlenlp commit id           : 81f5ab54525d0a2f2acc9217a74d9e028583fa1b.dirty
[2024-11-01 07:55:07,523] [   DEBUG] - _no_sync_in_gradient_accumulation: True
[2024-11-01 07:55:07,523] [   DEBUG] - adam_beta1                    : 0.9
[2024-11-01 07:55:07,523] [   DEBUG] - adam_beta2                    : 0.999
[2024-11-01 07:55:07,523] [   DEBUG] - adam_epsilon                  : 1e-08
[2024-11-01 07:55:07,523] [   DEBUG] - amp_custom_black_list         : None
[2024-11-01 07:55:07,523] [   DEBUG] - amp_custom_white_list         : None
[2024-11-01 07:55:07,523] [   DEBUG] - amp_master_grad               : False
[2024-11-01 07:55:07,523] [   DEBUG] - auto_parallel_resume_form_hybrid_parallel: False
[2024-11-01 07:55:07,523] [   DEBUG] - autotuner_benchmark           : False
[2024-11-01 07:55:07,523] [   DEBUG] - benchmark                     : False
[2024-11-01 07:55:07,523] [   DEBUG] - bf16                          : True
[2024-11-01 07:55:07,523] [   DEBUG] - bf16_full_eval                : False
[2024-11-01 07:55:07,524] [   DEBUG] - context_parallel_degree       : 1
[2024-11-01 07:55:07,524] [   DEBUG] - current_device                : gpu:0
[2024-11-01 07:55:07,524] [   DEBUG] - data_parallel_config          : 
[2024-11-01 07:55:07,524] [   DEBUG] - data_parallel_degree          : 1
[2024-11-01 07:55:07,524] [   DEBUG] - data_parallel_rank            : 0
[2024-11-01 07:55:07,524] [   DEBUG] - dataloader_drop_last          : False
[2024-11-01 07:55:07,524] [   DEBUG] - dataloader_num_workers        : 0
[2024-11-01 07:55:07,524] [   DEBUG] - dataset_rank                  : 0
[2024-11-01 07:55:07,524] [   DEBUG] - dataset_world_size            : 2
[2024-11-01 07:55:07,524] [   DEBUG] - ddp_find_unused_parameters    : None
[2024-11-01 07:55:07,524] [   DEBUG] - decay_steps                   : 0
[2024-11-01 07:55:07,524] [   DEBUG] - device                        : gpu
[2024-11-01 07:55:07,524] [   DEBUG] - disable_tqdm                  : True
[2024-11-01 07:55:07,524] [   DEBUG] - distributed_dataloader        : False
[2024-11-01 07:55:07,524] [   DEBUG] - do_eval                       : True
[2024-11-01 07:55:07,525] [   DEBUG] - do_export                     : False
[2024-11-01 07:55:07,525] [   DEBUG] - do_predict                    : False
[2024-11-01 07:55:07,525] [   DEBUG] - do_train                      : True
[2024-11-01 07:55:07,525] [   DEBUG] - enable_auto_parallel          : False
[2024-11-01 07:55:07,525] [   DEBUG] - eval_accumulation_steps       : 16
[2024-11-01 07:55:07,525] [   DEBUG] - eval_batch_size               : 8
[2024-11-01 07:55:07,525] [   DEBUG] - eval_steps                    : None
[2024-11-01 07:55:07,525] [   DEBUG] - evaluation_strategy           : IntervalStrategy.EPOCH
[2024-11-01 07:55:07,525] [   DEBUG] - flatten_param_grads           : False
[2024-11-01 07:55:07,525] [   DEBUG] - force_reshard_pp              : False
[2024-11-01 07:55:07,525] [   DEBUG] - fp16                          : False
[2024-11-01 07:55:07,525] [   DEBUG] - fp16_full_eval                : False
[2024-11-01 07:55:07,525] [   DEBUG] - fp16_opt_level                : O2
[2024-11-01 07:55:07,525] [   DEBUG] - fuse_sequence_parallel_allreduce: False
[2024-11-01 07:55:07,525] [   DEBUG] - gradient_accumulation_steps   : 2
[2024-11-01 07:55:07,525] [   DEBUG] - greater_is_better             : True
[2024-11-01 07:55:07,526] [   DEBUG] - hybrid_parallel_topo_order    : pp_first
[2024-11-01 07:55:07,526] [   DEBUG] - ignore_data_skip              : False
[2024-11-01 07:55:07,526] [   DEBUG] - ignore_load_lr_and_optim      : False
[2024-11-01 07:55:07,526] [   DEBUG] - ignore_save_lr_and_optim      : False
[2024-11-01 07:55:07,526] [   DEBUG] - label_names                   : None
[2024-11-01 07:55:07,526] [   DEBUG] - lazy_data_processing          : True
[2024-11-01 07:55:07,526] [   DEBUG] - learning_rate                 : 3e-05
[2024-11-01 07:55:07,526] [   DEBUG] - load_best_model_at_end        : True
[2024-11-01 07:55:07,526] [   DEBUG] - load_sharded_model            : False
[2024-11-01 07:55:07,526] [   DEBUG] - local_process_index           : 0
[2024-11-01 07:55:07,526] [   DEBUG] - local_rank                    : 0
[2024-11-01 07:55:07,526] [   DEBUG] - log_level                     : -1
[2024-11-01 07:55:07,526] [   DEBUG] - log_level_replica             : -1
[2024-11-01 07:55:07,526] [   DEBUG] - log_on_each_node              : True
[2024-11-01 07:55:07,526] [   DEBUG] - logging_dir                   : /home/aistudio/llama/checkpoints/runs/Nov01_07-51-48_jupyter-942478-8345123
[2024-11-01 07:55:07,526] [   DEBUG] - logging_first_step            : False
[2024-11-01 07:55:07,527] [   DEBUG] - logging_steps                 : 1
[2024-11-01 07:55:07,527] [   DEBUG] - logging_strategy              : IntervalStrategy.STEPS
[2024-11-01 07:55:07,527] [   DEBUG] - logical_process_index         : 0
[2024-11-01 07:55:07,527] [   DEBUG] - lr_end                        : 1e-07
[2024-11-01 07:55:07,527] [   DEBUG] - lr_scheduler_type             : SchedulerType.LINEAR
[2024-11-01 07:55:07,527] [   DEBUG] - max_evaluate_steps            : -1
[2024-11-01 07:55:07,527] [   DEBUG] - max_grad_norm                 : 1.0
[2024-11-01 07:55:07,527] [   DEBUG] - max_steps                     : -1
[2024-11-01 07:55:07,527] [   DEBUG] - metric_for_best_model         : accuracy
[2024-11-01 07:55:07,527] [   DEBUG] - minimum_eval_times            : None
[2024-11-01 07:55:07,527] [   DEBUG] - no_cuda                       : False
[2024-11-01 07:55:07,527] [   DEBUG] - no_recompute_layers           : None
[2024-11-01 07:55:07,527] [   DEBUG] - num_cycles                    : 0.5
[2024-11-01 07:55:07,527] [   DEBUG] - num_train_epochs              : 1.0
[2024-11-01 07:55:07,527] [   DEBUG] - optim                         : OptimizerNames.ADAMW
[2024-11-01 07:55:07,527] [   DEBUG] - optimizer_name_suffix         : shard00
[2024-11-01 07:55:07,528] [   DEBUG] - output_dir                    : /home/aistudio/llama/checkpoints
[2024-11-01 07:55:07,528] [   DEBUG] - output_signal_dir             : /home/aistudio/llama/checkpoints
[2024-11-01 07:55:07,528] [   DEBUG] - overwrite_output_dir          : False
[2024-11-01 07:55:07,528] [   DEBUG] - past_index                    : -1
[2024-11-01 07:55:07,528] [   DEBUG] - per_device_eval_batch_size    : 8
[2024-11-01 07:55:07,528] [   DEBUG] - per_device_train_batch_size   : 1
[2024-11-01 07:55:07,528] [   DEBUG] - pipeline_parallel_config      : disable_p2p_cache_shape
[2024-11-01 07:55:07,528] [   DEBUG] - pipeline_parallel_degree      : 1
[2024-11-01 07:55:07,528] [   DEBUG] - pipeline_parallel_rank        : 0
[2024-11-01 07:55:07,528] [   DEBUG] - power                         : 1.0
[2024-11-01 07:55:07,528] [   DEBUG] - pp_recompute_interval         : 1
[2024-11-01 07:55:07,528] [   DEBUG] - prediction_loss_only          : False
[2024-11-01 07:55:07,528] [   DEBUG] - process_index                 : 0
[2024-11-01 07:55:07,528] [   DEBUG] - recompute                     : False
[2024-11-01 07:55:07,528] [   DEBUG] - recompute_granularity         : full
[2024-11-01 07:55:07,528] [   DEBUG] - recompute_use_reentrant       : False
[2024-11-01 07:55:07,529] [   DEBUG] - release_grads                 : False
[2024-11-01 07:55:07,529] [   DEBUG] - remove_unused_columns         : True
[2024-11-01 07:55:07,529] [   DEBUG] - report_to                     : ['visualdl']
[2024-11-01 07:55:07,529] [   DEBUG] - resume_from_checkpoint        : None
[2024-11-01 07:55:07,529] [   DEBUG] - run_name                      : /home/aistudio/llama/checkpoints
[2024-11-01 07:55:07,529] [   DEBUG] - save_on_each_node             : False
[2024-11-01 07:55:07,529] [   DEBUG] - save_sharded_model            : False
[2024-11-01 07:55:07,529] [   DEBUG] - save_steps                    : 500
[2024-11-01 07:55:07,529] [   DEBUG] - save_strategy                 : IntervalStrategy.EPOCH
[2024-11-01 07:55:07,529] [   DEBUG] - save_total_limit              : 1
[2024-11-01 07:55:07,529] [   DEBUG] - scale_loss                    : 32768
[2024-11-01 07:55:07,529] [   DEBUG] - seed                          : 42
[2024-11-01 07:55:07,529] [   DEBUG] - sep_parallel_degree           : 1
[2024-11-01 07:55:07,529] [   DEBUG] - sequence_parallel             : False
[2024-11-01 07:55:07,529] [   DEBUG] - sequence_parallel_config      : 
[2024-11-01 07:55:07,529] [   DEBUG] - sharding                      : [<ShardingOption.SHARD_GRAD_OP: 'stage2'>]
[2024-11-01 07:55:07,530] [   DEBUG] - sharding_comm_buffer_size_MB  : -1
[2024-11-01 07:55:07,530] [   DEBUG] - sharding_degree               : -1
[2024-11-01 07:55:07,530] [   DEBUG] - sharding_parallel_config      : 
[2024-11-01 07:55:07,530] [   DEBUG] - sharding_parallel_degree      : 2
[2024-11-01 07:55:07,530] [   DEBUG] - sharding_parallel_rank        : 0
[2024-11-01 07:55:07,530] [   DEBUG] - should_load_dataset           : True
[2024-11-01 07:55:07,530] [   DEBUG] - should_load_sharding_stage1_model: False
[2024-11-01 07:55:07,530] [   DEBUG] - should_log                    : True
[2024-11-01 07:55:07,530] [   DEBUG] - should_save                   : True
[2024-11-01 07:55:07,530] [   DEBUG] - should_save_model_state       : True
[2024-11-01 07:55:07,530] [   DEBUG] - should_save_sharding_stage1_model: False
[2024-11-01 07:55:07,530] [   DEBUG] - skip_data_intervals           : None
[2024-11-01 07:55:07,530] [   DEBUG] - skip_memory_metrics           : True
[2024-11-01 07:55:07,530] [   DEBUG] - skip_profile_timer            : True
[2024-11-01 07:55:07,530] [   DEBUG] - tensor_parallel_config        : 
[2024-11-01 07:55:07,530] [   DEBUG] - tensor_parallel_degree        : 1
[2024-11-01 07:55:07,530] [   DEBUG] - tensor_parallel_output        : False
[2024-11-01 07:55:07,531] [   DEBUG] - tensor_parallel_rank          : 0
[2024-11-01 07:55:07,531] [   DEBUG] - to_static                     : False
[2024-11-01 07:55:07,531] [   DEBUG] - train_batch_size              : 1
[2024-11-01 07:55:07,531] [   DEBUG] - unified_checkpoint            : True
[2024-11-01 07:55:07,531] [   DEBUG] - unified_checkpoint_config     : ['']
[2024-11-01 07:55:07,531] [   DEBUG] - use_async_save                : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_expert_parallel           : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_flash_attention           : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_fused_dropout_add         : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_fused_linear              : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_fused_rms_norm            : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_fused_rope                : False
[2024-11-01 07:55:07,531] [   DEBUG] - use_hybrid_parallel           : True
[2024-11-01 07:55:07,531] [   DEBUG] - virtual_pp_degree             : 1
[2024-11-01 07:55:07,531] [   DEBUG] - wandb_api_key                 : None
[2024-11-01 07:55:07,531] [   DEBUG] - warmup_ratio                  : 0.0
[2024-11-01 07:55:07,532] [   DEBUG] - warmup_steps                  : 30
[2024-11-01 07:55:07,532] [   DEBUG] - weight_decay                  : 0.0
[2024-11-01 07:55:07,532] [   DEBUG] - weight_name_suffix            : 
[2024-11-01 07:55:07,532] [   DEBUG] - world_size                    : 2
[2024-11-01 07:55:07,532] [   DEBUG] - 
[2024-11-01 07:55:07,533] [    INFO] - Starting training from resume_from_checkpoint : None
/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/communication/group.py:128: UserWarning: Current global rank 0 is not in group _default_pg12
  warnings.warn(
WARNING:root:While using ClipGradByGlobalNorm in GroupShardedOptimizerStage2, the grad clip of original optimizer will be changed.
Traceback (most recent call last):
  File "run_finetune.py", line 735, in <module>
    main()
  File "run_finetune.py", line 575, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py", line 798, in train
    model = self._wrap_model(self.model_wrapped)
  File "/home/aistudio/PaddleNLP/paddlenlp/trainer/trainer.py", line 2053, in _wrap_model
    model, optimizer, _ = group_sharded_parallel(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/sharding/group_sharded.py", line 156, in group_sharded_parallel
    optimizer = GroupShardedOptimizerStage2(
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 240, in __init__
    self._update_opt_status()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 343, in _update_opt_status
    self._integration_params()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 463, in _integration_params
    self._generate_master_params(trainable_params)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 336, in _generate_master_params
    master_tensor = paddle.cast(param, Type.fp32.value)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/tensor/manipulation.py", line 237, in cast
    return _C_ops.cast(x, dtype)
MemoryError: 

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_cast(_object*, _object*, _object*)
1   cast_ad_func(paddle::Tensor const&, phi::DataType)
2   paddle::experimental::cast(paddle::Tensor const&, phi::DataType)
3   void phi::CastKernel<phi::dtype::bfloat16, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
4   void phi::CastCUDAKernelImpl<phi::dtype::bfloat16, float>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
5   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
6   phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
7   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::Allocator::Allocate(unsigned long)
12  paddle::memory::allocation::Allocator::Allocate(unsigned long)
13  paddle::memory::allocation::Allocator::Allocate(unsigned long)
14  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
15  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
16  common::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 64.000000MB memory on GPU 0, 15.713623GB memory has been allocated and available memory is only 60.125000MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
 (at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:84)

LAUNCH INFO 2024-11-01 07:55:20,920 Pod failed
LAUNCH ERROR 2024-11-01 07:55:20,921 Container failed !!!
Container rank 0 status failed cmd ['/usr/bin/python', '-u', 'run_finetune.py', './config/llama/sft_argument.json'] code 1 log log/workerlog.0
LAUNCH INFO 2024-11-01 07:55:20,921 ------------------------- ERROR LOG DETAIL -------------------------
stributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 240, in __init__
    self._update_opt_status()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 343, in _update_opt_status
    self._integration_params()
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 463, in _integration_params
    self._generate_master_params(trainable_params)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/distributed/fleet/meta_parallel/sharding/group_sharded_optimizer_stage2.py", line 336, in _generate_master_params
    master_tensor = paddle.cast(param, Type.fp32.value)
  File "/home/aistudio/.local/lib/python3.8/site-packages/paddle/tensor/manipulation.py", line 237, in cast
    return _C_ops.cast(x, dtype)
MemoryError: 

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_cast(_object*, _object*, _object*)
1   cast_ad_func(paddle::Tensor const&, phi::DataType)
2   paddle::experimental::cast(paddle::Tensor const&, phi::DataType)
3   void phi::CastKernel<phi::dtype::bfloat16, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
4   void phi::CastCUDAKernelImpl<phi::dtype::bfloat16, float>(phi::GPUContext const&, phi::DenseTensor const&, phi::DataType, phi::DenseTensor*)
5   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
6   phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
7   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
8   paddle::memory::allocation::Allocator::Allocate(unsigned long)
9   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::Allocator::Allocate(unsigned long)
12  paddle::memory::allocation::Allocator::Allocate(unsigned long)
13  paddle::memory::allocation::Allocator::Allocate(unsigned long)
14  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
15  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
16  common::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 64.000000MB memory on GPU 0, 15.713623GB memory has been allocated and available memory is only 60.125000MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
 (at ../paddle/phi/core/memory/allocation/cuda_allocator.cc:84)

LAUNCH INFO 2024-11-01 07:55:21,936 Exit code 1
aistudio@jupyter-942478-8345123:~/PaddleNLP/llm$ 

这是 paddlepaddle-gpu 3.0.0.dev20241030 的日志 ~ AIStudio 环境最多就是双卡 16G ~ 我也没办法继续测了 🫠

我们测试了stage2和stage2+继续训练两种方式,然后单独的stage2训练显存没问题,问题出在stage2+继续训练会出现OOM的问题。所以假设训练步数为S,是否可以尝试在S中间第一次保存下ckpt后,终止程序,然后重新运行训练程序并加载这个ckpt继续训练,此时应该能出现显存比没有这个PR时更大的情况?另外nvidia-smi的信息是不准确,可以通过以下方式,在每个step的优化器更新结束后(optimizer.step())后面打印下显存:

def print_memory_state(msg=''):
    """ print_memory_state """

    import time 
    import datetime
     
    timestamp = time.time()  
    dt_object = datetime.datetime.fromtimestamp(timestamp)  

    GB = 1024.0 * 1024.0 * 1024.0
    memory_allocated = paddle.device.cuda.memory_allocated() / GB 
    memory_reserved = paddle.device.cuda.memory_reserved() / GB 
    max_memory_allocated = paddle.device.cuda.max_memory_allocated() / GB 
    max_memory_reserved = paddle.device.cuda.max_memory_reserved() / GB 
    print(f'{dt_object}, {msg}, '
        f'memory_allocated: {memory_allocated:.02f}GB, '
        f'memory_reserved: {memory_reserved:.02f}GB, '
        f'max_memory_allocated: {max_memory_allocated:.02f}GB, '
        f'max_memory_reserved: {max_memory_reserved:.02f}GB')

@megemini
Copy link
Contributor Author

megemini commented Nov 5, 2024

问题出在stage2+继续训练会出现OOM的问题

那问题应该还是出在分布式这里 ~

这个过程包括:

  1. 保存参数
  2. 加载参数
  3. 从加载的参数继续训练

也就是说,加载原来的模型训练没有问题,但是:

  1. 分布式中
  2. 新的 optimizer 保存并加载后

会多占用显存 ~

感觉 Paddle/python/paddle/distributed/auto_parallel/api.py 有可能出问题 ~ 是不是需要对 optimizer 中 optional 的 Tensor 做单独处理?或者帮忙请教一下相关的研发大佬?

我也测一下分布式的显存占用情况 ~

感谢 ~

@megemini
Copy link
Contributor Author

megemini commented Nov 5, 2024

@HydrogenSulfate 之前说过,AIStudio 上最大的环境就是双卡 16g ,刚才又试了一下,不行,会 OOM (Paddle 官方版本也会),而且 sft_argument.json 中已经设置 batch size 是 1 了 ~

看看还有什么优化办法?或者换个模型?否则我这边进行不下去了 ~

另外,我试了一下自定义小模型上,分布式训练占用显存的情况,结论是:

  1. 不管是初次训练,还是加载 checkpoint 后训练,显存占用与 Paddle 官方版本一样。
  2. 加载 checkpoint 后训练,显存占用变大,不管是 amsgrad 版本还是 Paddle 官方版本,都会变大,且大小一样。

测试代码如下(参考文档 https://github.com/PaddlePaddle/community/blob/master/pfcc/paddle-code-reading/auto_parallel/paddle_distributed_primer.md#22412-%E5%8A%A8%E6%89%8B-group-sharded%E5%B9%B6%E8%A1%8C%E7%A4%BA%E4%BE%8B%E4%BB%A3%E7%A0%81 ,其中 level="os_g"):

# -*- coding: UTF-8 -*-
# 2.2.4.1.2 动手-group sharded并行示例代码

import os
import numpy as np
import paddle
# 导入必要分布式训练的依赖包
from paddle.distributed import fleet, get_rank
from paddle.distributed.sharding import group_sharded_parallel
# 导入数据加载和数据保存接口
from paddle.io import Dataset, DistributedBatchSampler, DataLoader

base_lr = 0.1   # 学习率
momentum_rate = 0.9 # 冲量
l2_decay = 1e-4 # 权重衰减

epoch = 5  #训练迭代次数
batch_num = 100 #每次迭代的 batch 数
batch_size = 32 #训练批次大小
class_dim = 10

USE_CKPT = False

# 设置数据读取器
class RandomDataset(Dataset):
    def __init__(self, num_samples):
        self.num_samples = num_samples

    def __getitem__(self, idx):
        image = np.random.random([256]).astype('float32')
        label = np.random.randint(0, class_dim - 1, (1, )).astype('int64')
        return image, label

    def __len__(self):
        return self.num_samples

# 设置优化器
def optimizer_setting(parameter_list=None):
    optimizer = paddle.optimizer.AdamW(
        learning_rate=base_lr,
        weight_decay=l2_decay,
        parameters=parameter_list)
    return optimizer

def print_memory_state(msg=''):
    """ print_memory_state """

    import time 
    import datetime
     
    timestamp = time.time()  
    dt_object = datetime.datetime.fromtimestamp(timestamp)  

    GB = 1024.0 * 1024.0 * 1024.0
    memory_allocated = paddle.device.cuda.memory_allocated() / GB 
    memory_reserved = paddle.device.cuda.memory_reserved() / GB 
    max_memory_allocated = paddle.device.cuda.max_memory_allocated() / GB 
    max_memory_reserved = paddle.device.cuda.max_memory_reserved() / GB 
    print(f'{dt_object}, {msg}, '
        f'memory_allocated: {memory_allocated:.02f}GB, '
        f'memory_reserved: {memory_reserved:.02f}GB, '
        f'max_memory_allocated: {max_memory_allocated:.02f}GB, '
        f'max_memory_reserved: {max_memory_reserved:.02f}GB')


# 模型网络
class SimpleNet(paddle.nn.Layer):
    def __init__(self, input_size, inner_size, output_size):
        super().__init__()
        self.linear1 = paddle.nn.Linear(input_size, inner_size)
        self.linear2 = paddle.nn.Linear(inner_size, input_size)
        self.linear3 = paddle.nn.Linear(input_size, output_size)
        self.relu = paddle.nn.ReLU()

    def forward(self, x):
        x = self.linear1(x)
        x = self.linear2(x)
        x = self.linear3(x)
        x = self.relu(x)
        return x


# 设置训练函数
def train_model():
    # 初始化 Fleet 环境
    fleet.init(is_collective=True)
    group = paddle.distributed.new_group([0, 1])

    model = SimpleNet(input_size=256, inner_size=102400, output_size=class_dim)
    optimizer = optimizer_setting(parameter_list=model.parameters())

    # wrap GroupSharded model, optimizer and scaler. level1='os', level2='os_g', level3='p_g_os'
    # model, optimizer, scaler = group_sharded_parallel(model, optimizer, level="p_g_os", group=group)
    model, optimizer, scaler = group_sharded_parallel(model, optimizer, level="os_g", group=group)
    
    dataset = RandomDataset(batch_num * batch_size)
    # 设置分布式批采样器,用于数据并行训练
    sampler = DistributedBatchSampler(dataset, rank=get_rank(),
                                      batch_size=batch_size,shuffle=False, drop_last=True)
    train_loader = DataLoader(dataset,
                            batch_sampler=sampler,
                            num_workers=1)

    if USE_CKPT:
        model_dict = paddle.load("checkpoints/model.pdparams")
        model.set_state_dict(model_dict)

    for eop in range(epoch):
        model.train()

        for batch_id, data in enumerate(train_loader()):
            img, label = data
            label.stop_gradient = True

            out = model(img)
            loss = paddle.nn.functional.cross_entropy(input=out, label=label)
            avg_loss = paddle.mean(x=loss)
            acc_top1 = paddle.metric.accuracy(input=out, label=label, k=1)
            acc_top5 = paddle.metric.accuracy(input=out, label=label, k=5)

            avg_loss.backward()
            optimizer.step()

            print_memory_state()


            model.clear_gradients()

            if batch_id % 5 == 0:
                print("[Epoch %d, batch %d] loss: %.5f, acc1: %.5f, acc5: %.5f" % (eop, batch_id, avg_loss, acc_top1, acc_top5))

    # 保存Layer参数
    paddle.save(model.state_dict(), "checkpoints/model.pdparams")

# 启动训练
if __name__ == '__main__':
    train_model()

    print('>>> USE_CKPT:', USE_CKPT)
    print('>>> commit:', paddle.version.commit)

通过修改 USE_CKPT 判断是否加载 checkpoint ~

运行命令为:

> python -m paddle.distributed.launch --gpus=0,1 --log_dir logs test_0.py

截取最后一个 epoch 的输出:

  1. 不加载 checkpoint

Paddle 官方版本

[Epoch 4, batch 45] loss: 2.30259, acc1: 0.15625, acc5: 0.59375
2024-11-05 13:14:10.537434, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:14:10.562654, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:14:10.587575, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:14:10.612507, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
>>> USE_CKPT: False
>>> commit: 11d1f4835f5afce78c0e9882f144877b3c4a9aac

amsgrad 版本(编译版本,不使用 amsgrad 参数)

[Epoch 4, batch 45] loss: 2.30259, acc1: 0.06250, acc5: 0.65625
2024-11-05 13:16:29.810763, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:16:29.835822, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:16:29.860752, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
2024-11-05 13:16:29.885698, , memory_allocated: 0.52GB, memory_reserved: 0.68GB, max_memory_allocated: 0.67GB, max_memory_reserved: 0.68GB
>>> USE_CKPT: False
>>> commit: 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7
  1. 加载 checkpoint 后

Paddle 官方版本

[Epoch 4, batch 45] loss: 2.30259, acc1: 0.15625, acc5: 0.62500
2024-11-05 13:14:39.615797, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:14:39.641230, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:14:39.666193, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:14:39.691418, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
>>> USE_CKPT: True
>>> commit: 11d1f4835f5afce78c0e9882f144877b3c4a9aac

amsgrad 版本(编译版本,不使用 amsgrad 参数)

[Epoch 4, batch 45] loss: 2.30259, acc1: 0.06250, acc5: 0.71875
2024-11-05 13:17:21.994855, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:17:22.019996, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:17:22.044928, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
2024-11-05 13:17:22.069807, , memory_allocated: 0.72GB, memory_reserved: 0.88GB, max_memory_allocated: 0.87GB, max_memory_reserved: 0.88GB
>>> USE_CKPT: True
>>> commit: 6a8f1771145117d0e4b4f156f4a5b8deb0c834a7

可以看到,加载 checkpoint 后,显存占用变大了(是我这里调用方法有问题?),但是 Paddle 版本与 amsgrad 版本显存大小一致 ~

后续可以尝试的测试手段:

  • 修改上面的测试代码,使其更贴近 PaddleNLP 的使用方式
  • 不用 llama,换一个其他的模型试试看

还请帮忙看一下 ~ 感谢!!!

@HydrogenSulfate
Copy link
Contributor

@megemini 好的,明天我在我们的机器上自测一下llama试试。问题确实比较奇怪

@HydrogenSulfate
Copy link
Contributor

@megemini 加载ckpt后显存变大有可能是因为paddle.load不支持map_location='cpu',导致加载进来的参数在GPU上

@megemini
Copy link
Contributor Author

megemini commented Nov 5, 2024

@megemini 好的,明天我在我们的机器上自测一下llama试试。问题确实比较奇怪

感谢!!!

分布式这部分逻辑比较复杂,主要是担心,之前没有涉及到 optional 的 optimizer 参数,会不会有需要特殊处理的地方 ... ...

@megemini megemini dismissed stale reviews from phlrain, heavengate, and zyfncg via af27337 November 8, 2024 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants