Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record job and DB ID in slurm job #40

Open
janosh opened this issue Jun 16, 2024 · 2 comments
Open

Record job and DB ID in slurm job #40

janosh opened this issue Jun 16, 2024 · 2 comments

Comments

@janosh
Copy link
Collaborator

janosh commented Jun 16, 2024

sbatch has a handy --comment that allows adding metadata to jobs:

sbatch -h | grep comment
      --comment=name          arbitrary comment

i'd like to use this to connect my slurm jobs back to jobs in the database either via their jobflow or their database ID. i saw there's no explicit support for comment in SlurmIO:

class SlurmIO(BaseSchedulerIO):
header_template: str = """
#SBATCH --partition=$${partition}
#SBATCH --job-name=$${job_name}
#SBATCH --nodes=$${nodes}
#SBATCH --ntasks=$${ntasks}
#SBATCH --ntasks-per-node=$${ntasks_per_node}
#SBATCH --cpus-per-task=$${cpus_per_task}
#SBATCH --mem=$${mem}
#SBATCH --mem-per-cpu=$${mem_per_cpu}
#SBATCH --hint=$${hint}
#SBATCH --time=$${time}
#SBATCH --exclude=$${exclude_nodes}
#SBATCH --account=$${account}
#SBATCH --mail-user=$${mail_user}
#SBATCH --mail-type=$${mail_type}
#SBATCH --constraint=$${constraint}
#SBATCH --gres=$${gres}
#SBATCH --requeue=$${requeue}
#SBATCH --nodelist=$${nodelist}
#SBATCH --propagate=$${propagate}
#SBATCH --licenses=$${licenses}
#SBATCH --output=$${qout_path}
#SBATCH --error=$${qerr_path}
#SBATCH --qos=$${qos}
#SBATCH --priority=$${priority}
#SBATCH --array=$${array}
#SBATCH --exclusive=$${exclusive}
$${qverbatim}"""

maybe qverbatim is meant as an escape hatch for situations like this? didn't find any docs on it. also not sure how to set qverbatim. this raises

QResources(
    processes=16, job_name="name", qverbatim="test"
).as_dict()
>>> TypeError: QResources.__init__() got an unexpected keyword argument 'qverbatim'

maybe like this? didn't try yet but either way would be good to document how to pass qverbatim

QResources(
    processes=16, job_name="name", scheduler_kwargs={"qverbatim": "test"}
).as_dict()
@davidwaroquiers
Copy link
Member

Hi @janosh

Thanks for the question. Regarding qverbatim, indeed, it should be used as in your second example:

qr = QResources(
    processes=16,
    job_name="name",
    scheduler_kwargs={"qverbatim": '#SBATCH --comment="this is my comment"'}
)

QResources is indeed meant to be used as a common object for specifying resources. Difficulty is that not all DRM's provide the same functionalities. This is why there is the scheduler_kwargs, allowing you to pass anything you want by yourself (through qverbatim in slurm).

Now if comment were to be a commonly (and common) option in DRM's (well currently qtoolkit only supports PBS and slurm), we could add it to QResources. I'm not sure PBS has an "equivalent".

Now maybe there is a different question to be raised in why you want to have the dbid in the comment and maybe there is a different way to do that (maybe in jobflow-remote) ? Open to discuss if needed.

Pinging @gpetretto in case I said something wrong here :-)

@janosh
Copy link
Collaborator Author

janosh commented Jun 24, 2024

thanks @davidwaroquiers, that's very helpful!

i wanted the DB ID in the comment to be able to map from the output of squeue which only shows slurm IDs back to the corresponding DB entries. i asked this question before realizing that scontrol show job <slurm-id> shows the run_dir which in turn contains the job's UUID which I can use to get the corresponding DB entry. so like you said, there is a solution that doesn't require SBATCH --comment.

i'll still be using --comment to record things like formula, n_sites, volume as job metadata since it makes debugging jobs that take an inordinate amount of time easier. currently, it's a slow process that i have to repeat for dozens of calcs to map from the slurm ID to the run_dir or DB entry to see if it's just a large system of if the calculation uses bad settings and stalled for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants