New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

二阶段微调错误 #16

Open

syf-fgnb opened this issue Mar 26, 2024 · 2 comments

syf-fgnb commented Mar 26, 2024

我按照文档准备finetune第二阶段，直接运行sh scripts_asmv2/stage2-finetune.sh，报了下边的错：

ValueError: Looks like distributed multinode run but MASTER_ADDR env not set, please try exporting rank 0's hostname as MASTER_ADDR

然后我把命令改成torchrun --master_port=xxxxx，结果报了CUDA Out of memory的错（即使我已经把bacthsize设成1了），环境是A100+deepspeed zero2，请问这是怎么回事

Collaborator

Weiyun1025 commented May 8, 2024

你好，我们的运行脚本是基于slurm系统的srun指令进行启动的，代码运行时会检查slurm的环境变量来自动的设置MASTER_ADDR等环境变量，如果你是用其他系统启动的，需要将脚本改写成用torchrun指令启动

CUDA out of memory应该是因为没有成功地启动多卡任务，请检查是否设置了--nnodes=xxx和--nproc-per-node=xxx等变量

cullinan1998 commented Jan 9, 2025

你好请问你解决了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment