Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

二阶段微调错误 #16

Open
syf-fgnb opened this issue Mar 26, 2024 · 2 comments
Open

二阶段微调错误 #16

syf-fgnb opened this issue Mar 26, 2024 · 2 comments

Comments

@syf-fgnb
Copy link

我按照文档准备finetune第二阶段,直接运行sh scripts_asmv2/stage2-finetune.sh,报了下边的错:

ValueError: Looks like distributed multinode run but MASTER_ADDR env not set, please try exporting rank 0's hostname as MASTER_ADDR

然后我把命令改成torchrun --master_port=xxxxx,结果报了CUDA Out of memory的错(即使我已经把bacthsize设成1了),环境是A100+deepspeed zero2,请问这是怎么回事

@Weiyun1025
Copy link
Collaborator

你好,我们的运行脚本是基于slurm系统的srun指令进行启动的,代码运行时会检查slurm的环境变量来自动的设置MASTER_ADDR等环境变量,如果你是用其他系统启动的,需要将脚本改写成用torchrun指令启动

CUDA out of memory应该是因为没有成功地启动多卡任务,请检查是否设置了--nnodes=xxx--nproc-per-node=xxx等变量

@cullinan1998
Copy link

你好 请问你解决了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants