Record job and DB ID in slurm job #40

janosh · 2024-06-16T23:32:30Z

sbatch has a handy --comment that allows adding metadata to jobs:

sbatch -h | grep comment
      --comment=name          arbitrary comment

i'd like to use this to connect my slurm jobs back to jobs in the database either via their jobflow or their database ID. i saw there's no explicit support for comment in SlurmIO:

qtoolkit/src/qtoolkit/io/slurm.py

Lines 148 to 176 in e95110a

    
           class SlurmIO(BaseSchedulerIO): 
        
               header_template: str = """ 
        
           #SBATCH --partition=$${partition} 
        
           #SBATCH --job-name=$${job_name} 
        
           #SBATCH --nodes=$${nodes} 
        
           #SBATCH --ntasks=$${ntasks} 
        
           #SBATCH --ntasks-per-node=$${ntasks_per_node} 
        
           #SBATCH --cpus-per-task=$${cpus_per_task} 
        
           #SBATCH --mem=$${mem} 
        
           #SBATCH --mem-per-cpu=$${mem_per_cpu} 
        
           #SBATCH --hint=$${hint} 
        
           #SBATCH --time=$${time} 
        
           #SBATCH	--exclude=$${exclude_nodes} 
        
           #SBATCH --account=$${account} 
        
           #SBATCH --mail-user=$${mail_user} 
        
           #SBATCH --mail-type=$${mail_type} 
        
           #SBATCH --constraint=$${constraint} 
        
           #SBATCH --gres=$${gres} 
        
           #SBATCH --requeue=$${requeue} 
        
           #SBATCH --nodelist=$${nodelist} 
        
           #SBATCH --propagate=$${propagate} 
        
           #SBATCH --licenses=$${licenses} 
        
           #SBATCH --output=$${qout_path} 
        
           #SBATCH --error=$${qerr_path} 
        
           #SBATCH --qos=$${qos} 
        
           #SBATCH --priority=$${priority} 
        
           #SBATCH --array=$${array} 
        
           #SBATCH --exclusive=$${exclusive} 
        
           $${qverbatim}"""

maybe qverbatim is meant as an escape hatch for situations like this? didn't find any docs on it. also not sure how to set qverbatim. this raises

QResources(
    processes=16, job_name="name", qverbatim="test"
).as_dict()
>>> TypeError: QResources.__init__() got an unexpected keyword argument 'qverbatim'

maybe like this? didn't try yet but either way would be good to document how to pass qverbatim

QResources(
    processes=16, job_name="name", scheduler_kwargs={"qverbatim": "test"}
).as_dict()

The text was updated successfully, but these errors were encountered:

davidwaroquiers · 2024-06-24T09:57:09Z

Hi @janosh

Thanks for the question. Regarding qverbatim, indeed, it should be used as in your second example:

qr = QResources(
    processes=16,
    job_name="name",
    scheduler_kwargs={"qverbatim": '#SBATCH --comment="this is my comment"'}
)

QResources is indeed meant to be used as a common object for specifying resources. Difficulty is that not all DRM's provide the same functionalities. This is why there is the scheduler_kwargs, allowing you to pass anything you want by yourself (through qverbatim in slurm).

Now if comment were to be a commonly (and common) option in DRM's (well currently qtoolkit only supports PBS and slurm), we could add it to QResources. I'm not sure PBS has an "equivalent".

Now maybe there is a different question to be raised in why you want to have the dbid in the comment and maybe there is a different way to do that (maybe in jobflow-remote) ? Open to discuss if needed.

Pinging @gpetretto in case I said something wrong here :-)

janosh · 2024-06-24T11:09:15Z

thanks @davidwaroquiers, that's very helpful!

i wanted the DB ID in the comment to be able to map from the output of squeue which only shows slurm IDs back to the corresponding DB entries. i asked this question before realizing that scontrol show job <slurm-id> shows the run_dir which in turn contains the job's UUID which I can use to get the corresponding DB entry. so like you said, there is a solution that doesn't require SBATCH --comment.

i'll still be using --comment to record things like formula, n_sites, volume as job metadata since it makes debugging jobs that take an inordinate amount of time easier. currently, it's a slow process that i have to repeat for dozens of calcs to map from the slurm ID to the run_dir or DB entry to see if it's just a large system of if the calculation uses bad settings and stalled for some reason.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record job and DB ID in slurm job #40

Record job and DB ID in slurm job #40

janosh commented Jun 16, 2024

davidwaroquiers commented Jun 24, 2024

janosh commented Jun 24, 2024 •

edited

Loading

Record job and DB ID in slurm job #40

Record job and DB ID in slurm job #40

Comments

janosh commented Jun 16, 2024

davidwaroquiers commented Jun 24, 2024

janosh commented Jun 24, 2024 • edited Loading

janosh commented Jun 24, 2024 •

edited

Loading