We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我按照文档准备finetune第二阶段,直接运行sh scripts_asmv2/stage2-finetune.sh,报了下边的错:
ValueError: Looks like distributed multinode run but MASTER_ADDR env not set, please try exporting rank 0's hostname as MASTER_ADDR
然后我把命令改成torchrun --master_port=xxxxx,结果报了CUDA Out of memory的错(即使我已经把bacthsize设成1了),环境是A100+deepspeed zero2,请问这是怎么回事
The text was updated successfully, but these errors were encountered:
你好,我们的运行脚本是基于slurm系统的srun指令进行启动的,代码运行时会检查slurm的环境变量来自动的设置MASTER_ADDR等环境变量,如果你是用其他系统启动的,需要将脚本改写成用torchrun指令启动
srun
MASTER_ADDR
torchrun
CUDA out of memory应该是因为没有成功地启动多卡任务,请检查是否设置了--nnodes=xxx和--nproc-per-node=xxx等变量
--nnodes=xxx
--nproc-per-node=xxx
Sorry, something went wrong.
你好 请问你解决了吗
No branches or pull requests
我按照文档准备finetune第二阶段,直接运行sh scripts_asmv2/stage2-finetune.sh,报了下边的错:
ValueError: Looks like distributed multinode run but MASTER_ADDR env not set, please try exporting rank 0's hostname as MASTER_ADDR
然后我把命令改成torchrun --master_port=xxxxx,结果报了CUDA Out of memory的错(即使我已经把bacthsize设成1了),环境是A100+deepspeed zero2,请问这是怎么回事
The text was updated successfully, but these errors were encountered: