tracking issue: standby/preemptible jobs #5739

grondo · 2024-02-14T17:29:18Z

Preemptible jobs:
AKA 'standby' qos / queue. Allow users to submit jobs that can be killed automatically by the system instance if another job needs the resources.

In some offline discussion, it was proposed that we could add a preemptible (or similar) job submission flag for this purpose. Drawbacks to this approach:

flags are currently not shared with the scheduler
flags are not displayed in flux jobs output
flags cannot be updated from the command line

Most of those can be easily overcome if a submission flag is the correct approach.

Alternate solutions include:

a jobspec attribute. In this case an initial solution could be accomplished in the scheduler alone. If a job has the preemptible attribute and the scheduler needs its resources for a higher priority job, then the scheduler could raise a job exception.
As noted above, "standby" is often implemented as a queue. To do this in Flux might require some work on the queues interface, since this implies overlapping queues. The benefit of using a queue is that queue limits (and user/bank access) can be applied.

The text was updated successfully, but these errors were encountered:

ryanday36 · 2024-02-14T19:12:46Z

I think that a submission flag would work as long as the drawbacks that you noted could be overcome. Generally we allow 'standby' jobs to be exempt from other queue limits and allow all users to access them. So, we would also want the preemptible flag could also be seen by the priority plugin so that it can not count those jobs against queue limits. I think that would provide the same benefits as the queue implementation, at least for how we use standby / preemption.

That said, there are a number of use cases that can be solved by overlapping queues (exempt / expedite, whole cluster DATs), so that could be considered a benefit of that approach. Exempt / expedite could probably all be done through accounting / the priority plugin. We should probably talk more about DATs where we want to be able to let a user run on all nodes on a cluster that we've split into multiple queues.

grondo · 2024-08-15T14:12:51Z

This idea was discussed again in a meeting recently. The preemptible flag still seems to be the solution of choice, but this will require an update to the resource acquisition protocol. I've opened flux-framework/rfc#423.

ryanday36 · 2024-10-07T21:01:21Z

Over in the flux team on Teams, one of the users on Tuolumne had an interesting idea around standby / preemption, which would be to allow users to specify a minimum duration for their jobs:

However I got to thinking that a minimum time in addition to a maximum time could create a more powerful mechanism than standby. If you wanted a slurm like standby you would set your job's minimum time to 0, but if you wanted to actually get something done but also let other jobs in after you'd made some progress setting a minimum time of an hour or something might be a reasonable compromise.

ryanday36 mentioned this issue Jul 30, 2024

Consider running a separate qmanager for each queue flux-framework/flux-sched#1258

Open

grondo mentioned this issue Aug 15, 2024

RFC 27: consider adding flags to alloc request flux-framework/rfc#423

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracking issue: standby/preemptible jobs #5739

tracking issue: standby/preemptible jobs #5739

grondo commented Feb 14, 2024

ryanday36 commented Feb 14, 2024

grondo commented Aug 15, 2024 •

edited

Loading

ryanday36 commented Oct 7, 2024

tracking issue: standby/preemptible jobs #5739

tracking issue: standby/preemptible jobs #5739

Comments

grondo commented Feb 14, 2024

ryanday36 commented Feb 14, 2024

grondo commented Aug 15, 2024 • edited Loading

ryanday36 commented Oct 7, 2024

grondo commented Aug 15, 2024 •

edited

Loading