-
-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idist.initialize
fails in Slurm when using --ntasks-per-gpu
#3259
Comments
Also, would it be possible not to warn when only ignite/ignite/distributed/comp_models/native.py Lines 607 to 614 in 34a707e
|
idist.initialize
fails in Slurn when using --ntasks-per-gpu
idist.initialize
fails in Slurm when using --ntasks-per-gpu
Thanks for reporting the issue @nowtryz ! Let me see what can be done here.
There is no env var responsible for that this argument ? By the way, why this is necessary to set it and what's the typical value, 1 ?
Yes, this is unfortunate. IIRC, we rely on |
Hi,
From what I see, there is only one input environment variable and no output one. Yes
I would simply use |
Following https://slurm.schedmd.com/sbatch.html, @nowtryz can you provide the full traceback to get the exact error message? If I understand correctly the problem, each process is seeing a single GPU and not all gpus, so As for the fix, IMO there are two things to be done here:
if torch.cuda.is_available():
lrank = self._local_rank if self._local_rank < torch.cuda.device_count() else 0
torch.cuda.set_device(lrank) @nowtryz would you like to help solving this and check if this fix works ?
|
Hi @vfdev-5,
Sure, I will get the traceback ASAP
|
This can be tricky to verify and get a correct DDP configuration when we mix slurm env vars with pytorch ddp env vars (e.g.
what do you mean exactly here, which environment? Here is where we translate slurm vars into pth env: ignite/ignite/distributed/comp_models/native.py Lines 554 to 639 in aa3e3e1
Maybe, we could relax this part: ignite/ignite/distributed/comp_models/native.py Lines 565 to 571 in aa3e3e1
MASTER_PORT , MASTER_ADDR , RANK , LOCAL_RANK , WORLD_SIZE from the env if user has provided. The problem here could be to be able to verify whether there is not inconsistency between slurm env vars and user provided pth env vars...
For the info, some time ago, @sdesrozis wrote this notes on how to use ignite with slurm: https://github.com/sdesrozis/why-ignite/tree/main/basics/2_slurm |
🐛 Bug description
When summoning a slurm step with multiple tasks assigning GPUs with the
--ntasks-per-gpu
flag instead of the--ntasks-per-node
as it seems it was intended, ignite uses theSLURM_LOCALID
environment as the local rank and use it as the device id to use even though the--ntasks-per-gpu
already binds the MPI process with a GPU, which cause the calltorch.cuda.set_device(self._local_rank)
to fail.To reproduce:
srun --ntasks-per-gpu=1 --nodes=2 --gpus-per-node=4 python -e "import ignite.distributed as idist; idist.initialize(backend='nccl')"
Which produces the following output:
Intended behaviour:
Either
--ntasks-per-gpu
flag, which does not seem to be possibleidist.set_local_rank()
, which is never considered whenSLURM_JOB_ID
is detectedEnvironment
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: