`idist.initialize` fails in Slurm when using `--ntasks-per-gpu` #3259

nowtryz · 2024-06-26T15:13:40Z

🐛 Bug description

When summoning a slurm step with multiple tasks assigning GPUs with the --ntasks-per-gpu flag instead of the --ntasks-per-node as it seems it was intended, ignite uses the SLURM_LOCALID environment as the local rank and use it as the device id to use even though the --ntasks-per-gpu already binds the MPI process with a GPU, which cause the call torch.cuda.set_device(self._local_rank) to fail.

To reproduce:

srun --ntasks-per-gpu=1 --nodes=2 --gpus-per-node=4 python -e "import ignite.distributed as idist; idist.initialize(backend='nccl')"

Which produces the following output:

    idist.initialize(backend="nccl")
  File ".../python3.11/site-packages/ignite/distributed/utils.py", line 577, in initialize
    _set_model(comp_model_cls(backend, **kwargs))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../python3.11/site-packages/ignite/distributed/comp_models/native.py", line 92, in __init__
    self._create_from_backend(
  File ".../python3.11/site-packages/ignite/distributed/comp_models/native.py", line 127, in _create_from_backend
    torch.cuda.set_device(self._local_rank)
  File ".../python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)

Intended behaviour:
Either

Detect the presence of the --ntasks-per-gpu flag, which does not seem to be possible
Detect that only one GPU is available and use it
Allow to explicitly set the local id or the device to use even though idist is initialized in a slurm environment
Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected

Environment

PyTorch Version (e.g., 1.4): 2.2
Ignite Version (e.g., 0.3.0): 0.5.0.post2
OS (e.g., Linux): Linux
How you installed Ignite (conda, pip, source): pip
Python version: 3.11
Any other relevant information: slurm 23.02.7

The text was updated successfully, but these errors were encountered:

nowtryz · 2024-06-26T15:55:04Z

Also, would it be possible not to warn when only MASTER_ADDR and MASTER_PORT are provided as there are used and seem to be intended to be provided

ignite/ignite/distributed/comp_models/native.py

Lines 607 to 614 in 34a707e

    
           elif any(defined_pth_ddp_env_vars): 
        
               # Let's warn user about PTH env variables that we could not taken into account 
        
               warnings.warn( 
        
                   "We detected the following env variables: " 
        
                   f"{[(k, v) for k, v in pth_ddp_env_vars.items() if v is not None]},\n" 
        
                   "but will not take them into account as the following env vars are missing:" 
        
                   f"{[k for k, v in pth_ddp_env_vars.items() if v is None]},\n" 
        
               )

vfdev-5 · 2024-06-26T19:45:55Z

Thanks for reporting the issue @nowtryz ! Let me see what can be done here.

Detect the presence of the --ntasks-per-gpu flag, which does not seem to be possible

There is no env var responsible for that this argument ? By the way, why this is necessary to set it and what's the typical value, 1 ?

Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected

Yes, this is unfortunate. IIRC, we rely on SLURM_LOCALID as the a single local rank provider...
In this case, if idist.set_local_rank() were working, how would you set it ? Is it possible to know how which MPI process is binded to which GPU ?

nowtryz · 2024-07-01T17:37:08Z

Hi,

There is no env var responsible for that this argument ? By the way, why this is necessary to set it and what's the typical value, 1 ?

From what I see, there is only one input environment variable and no output one. Yes --ntasks-per-gpu would typically be used with a defined number of GPUs (--gpus=N), which helps spawn a process on a defined number of GPUs instead of a defined number of nodes. It also helps to eventually use spare resources from multiple nodes whereas --nodes=N --ntasks-per-node=M --gpus-per-nodes=M requires nodes to be completely free.

In this case, if idist.set_local_rank() were working, how would you set it ? Is it possible to know how which MPI process is binded to which GPU ?

I would simply use idist.set_local_rank(0) which would use cuda:0. I don't exactly know how the binding is done but it seems to be similar to have the CUDA_VISIBLE_DEVICES flag set to the proper GPU and cuda:0 always points to the correct unit when --ntasks-per-gpu is used.

vfdev-5 · 2024-07-02T12:59:57Z

Following https://slurm.schedmd.com/sbatch.html, SLURM_NTASKS_PER_GPU is set when --ntasks-per-gpu is specified.

@nowtryz can you provide the full traceback to get the exact error message?

If I understand correctly the problem, each process is seeing a single GPU and not all gpus, so torch.cuda.set_device(self._local_rank) is failing for processes with local rank > 0.

As for the fix, IMO there are two things to be done here:

check here:

ignite/ignite/distributed/comp_models/native.py

Lines 126 to 127 in 34a707e

if torch.cuda.is_available():

torch.cuda.set_device(self._local_rank)

the number of available gpus with torch.cuda.device_count():

 if torch.cuda.is_available():
     lrank = self._local_rank if self._local_rank < torch.cuda.device_count() else 0
     torch.cuda.set_device(lrank)

@nowtryz would you like to help solving this and check if this fix works ?

Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected

nowtryz · 2024-08-07T21:32:22Z

Hi @vfdev-5,

Following https://slurm.schedmd.com/sbatch.html, SLURM_NTASKS_PER_GPU is set when --ntasks-per-gpu is specified.

SLURM_NTASKS_PER_GPU is just set by sbatch as an output variable so that it can be used as input by srun. It may not be reliable. In my case --ntasks-per-gpu was set on the srun commad directly, inside the script submitted to sbatch.

@nowtryz can you provide the full traceback to get the exact error message

Sure, I will get the traceback ASAP

As for the fix, IMO there are two things to be done here:

The first solution seems like a simple workaround, I will try it.
I think this is the correct solution. Even more, I think ignite should still allow user configuration on slurm as the slurm/job may not be configured the way it was intended in ignite.
Could it be possible to use environment only for information not specified by the user?
Let me explain, in my case for example, my sbatch script sets MASTER_PORT and MASTER_ADDR, as the address of the node 0 can be easily retrieved from the submitted script (which is running on node 0) but it is completly ignore by ignite.

vfdev-5 · 2024-08-12T07:35:08Z

This can be tricky to verify and get a correct DDP configuration when we mix slurm env vars with pytorch ddp env vars (e.g. MASTER_PORT, MASTER_ADDR, RANK, LOCAL_RANK, WORLD_SIZE).

Could it be possible to use environment only for information not specified by the user?

what do you mean exactly here, which environment?

Here is where we translate slurm vars into pth env:

ignite/ignite/distributed/comp_models/native.py

Lines 554 to 639 in aa3e3e1

    
           def _setup_ddp_vars_from_slurm_env(environ: Dict[str, str]) -> Dict[str, Union[str, int]]: 
        
               """Method to setup DDP env vars required by PyTorch from SLURM env""" 
        
               # 1) Tools like enroot can have hooks to translate slurm env vars to RANK, LOCAL_RANK, WORLD_SIZE etc 
        
               # See https://github.com/NVIDIA/enroot/blob/v3.1.0/conf/hooks/extra/50-slurm-pytorch.sh 
        
               # 2) User can use torch.distributed.launch tool to schedule on N local GPUs using 1 node, 1 task by SLURM 
        
               # To cover case 1), let's ensure that defined RANK == SLURM_PROCID, LOCAL_RANK == SLURM_LOCALID, 
        
               #   WORLD_SIZE == SLURM_NTASKS. We will use defined MASTER_ADDR and MASTER_PORT instead of defining 
        
               #   them by our means 
        
               # To cover case 2), let's check that defined RANK >= SLURM_PROCID, LOCAL_RANK >= SLURM_LOCALID, 
        
               #   WORLD_SIZE >= SLURM_NTASKS, SLURM_JOB_NUM_NODES == 1 
        
               ddp_vars: Dict[str, Union[str, int, None]] = { 
        
                   "RANK": int(environ["SLURM_PROCID"]), 
        
                   "LOCAL_RANK": int(environ["SLURM_LOCALID"]), 
        
                   "WORLD_SIZE": int(environ["SLURM_NTASKS"]), 
        
                   "MASTER_ADDR": None, 
        
                   "MASTER_PORT": None, 
        
               } 
        
               pth_ddp_env_vars = {key: environ.get(key, None) for key in ddp_vars} 
        
               defined_pth_ddp_env_vars = [v is not None for v in pth_ddp_env_vars.values()] 
        
               if all(defined_pth_ddp_env_vars): 
        
                   nnodes = int(environ["SLURM_JOB_NUM_NODES"]) 
        
                   if nnodes > 1: 
        
                       # ensure that all pth_ddp_env_vars are consistent with slurm vars 
        
                       for key in ["RANK", "LOCAL_RANK", "WORLD_SIZE"]: 
        
                           slurm_var = cast(int, ddp_vars[key]) 
        
                           pth_var = int(cast(str, pth_ddp_env_vars[key])) 
        
                           if slurm_var != pth_var: 
        
                               raise RuntimeError( 
        
                                   "Environment variable defined for PyTorch Distributed context is inconsistent with " 
        
                                   f"equivalent SLURM env variable. {key}: {pth_var} vs {slurm_var}\n" 
        
                                   f"SLURM vars: {ddp_vars}\n" 
        
                                   f"PTH vars: {pth_ddp_env_vars}\n" 
        
                               ) 
        
                   else: 
        
                       # ensure that PTH RANK >= SLURM_PROCID, PTH LOCAL_RANK >= SLURM_LOCALID, 
        
                       # PTH WORLD_SIZE >= SLURM_NTASKS 
        
                       for key in ["RANK", "LOCAL_RANK", "WORLD_SIZE"]: 
        
                           slurm_var = cast(int, ddp_vars[key]) 
        
                           pth_var = int(cast(str, pth_ddp_env_vars[key])) 
        
                           if pth_var < slurm_var: 
        
                               raise RuntimeError( 
        
                                   "Environment variable defined for PyTorch Distributed context is " 
        
                                   "inconsistent with equivalent SLURM env variable. " 
        
                                   f"We expect that {key}: {pth_var} >= {slurm_var}\n" 
        
                                   f"SLURM vars: {ddp_vars}\n" 
        
                                   f"PTH vars: {pth_ddp_env_vars}\n" 
        
                               ) 
        
                           ddp_vars[key] = pth_var 
        
                   # set up MASTER_ADDR and MASTER_PORT from PTH 
        
                   ddp_vars["MASTER_ADDR"] = cast(str, pth_ddp_env_vars["MASTER_ADDR"]) 
        
                   ddp_vars["MASTER_PORT"] = int(cast(str, pth_ddp_env_vars["MASTER_PORT"])) 
        
               elif any(defined_pth_ddp_env_vars): 
        
                   # Let's warn user about PTH env variables that we could not taken into account 
        
                   warnings.warn( 
        
                       "We detected the following env variables: " 
        
                       f"{[(k, v) for k, v in pth_ddp_env_vars.items() if v is not None]},\n" 
        
                       "but will not take them into account as the following env vars are missing:" 
        
                       f"{[k for k, v in pth_ddp_env_vars.items() if v is None]},\n" 
        
                   ) 
        
               if ddp_vars["MASTER_ADDR"] is None: 
        
                   nodelist = environ["SLURM_JOB_NODELIST"] 
        
                   try: 
        
                       # use scontrol to expand hostname list 
        
                       hostnames = subprocess.check_output(["scontrol", "show", "hostnames", nodelist]) 
        
                       method = "scontrol" 
        
                   except FileNotFoundError: 
        
                       # expand hostname list as scontrol 
        
                       hostnames = " ".join(_expand_hostlist(nodelist)).encode("utf-8") 
        
                       method = "ignite" 
        
                   # at least one hostname should be defined 
        
                   hostname_list = hostnames.split() 
        
                   if len(hostname_list) < 1: 
        
                       raise RuntimeError(f"No hostname detected in SLURM_JOB_NODELIST by {method} (nodelist={nodelist})") 
        
                   # master address is the first hostname of nodes list 
        
                   ddp_vars["MASTER_ADDR"] = str(hostname_list[0].decode("utf-8")) 
        
               if ddp_vars["MASTER_PORT"] is None: 
        
                   # port should be the same over all process 
        
                   slurm_port = environ["SLURM_JOB_ID"] 
        
                   slurm_port = slurm_port[-4:] 
        
                   ddp_vars["MASTER_PORT"] = int(slurm_port) + 15000 
        
               return cast(Dict[str, Union[str, int]], ddp_vars)

Maybe, we could relax this part:

ignite/ignite/distributed/comp_models/native.py

Lines 565 to 571 in aa3e3e1

    
           ddp_vars: Dict[str, Union[str, int, None]] = { 
        
               "RANK": int(environ["SLURM_PROCID"]), 
        
               "LOCAL_RANK": int(environ["SLURM_LOCALID"]), 
        
               "WORLD_SIZE": int(environ["SLURM_NTASKS"]), 
        
               "MASTER_ADDR": None, 
        
               "MASTER_PORT": None, 
        
           }

and use MASTER_PORT, MASTER_ADDR, RANK, LOCAL_RANK, WORLD_SIZE from the env if user has provided. The problem here could be to be able to verify whether there is not inconsistency between slurm env vars and user provided pth env vars...

For the info, some time ago, @sdesrozis wrote this notes on how to use ignite with slurm: https://github.com/sdesrozis/why-ignite/tree/main/basics/2_slurm

vfdev-5 changed the title ~~idist.initialize fails in Slurn when using --ntasks-per-gpu~~ idist.initialize fails in Slurm when using --ntasks-per-gpu Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`idist.initialize` fails in Slurm when using `--ntasks-per-gpu` #3259

`idist.initialize` fails in Slurm when using `--ntasks-per-gpu` #3259

nowtryz commented Jun 26, 2024 •

edited

Loading

nowtryz commented Jun 26, 2024 •

edited

Loading

vfdev-5 commented Jun 26, 2024

nowtryz commented Jul 1, 2024

vfdev-5 commented Jul 2, 2024

nowtryz commented Aug 7, 2024

vfdev-5 commented Aug 12, 2024

idist.initialize fails in Slurm when using --ntasks-per-gpu #3259

idist.initialize fails in Slurm when using --ntasks-per-gpu #3259

Comments

nowtryz commented Jun 26, 2024 • edited Loading

🐛 Bug description

Environment

nowtryz commented Jun 26, 2024 • edited Loading

vfdev-5 commented Jun 26, 2024

nowtryz commented Jul 1, 2024

vfdev-5 commented Jul 2, 2024

nowtryz commented Aug 7, 2024

vfdev-5 commented Aug 12, 2024

`idist.initialize` fails in Slurm when using `--ntasks-per-gpu` #3259

`idist.initialize` fails in Slurm when using `--ntasks-per-gpu` #3259

nowtryz commented Jun 26, 2024 •

edited

Loading

nowtryz commented Jun 26, 2024 •

edited

Loading