RFC 27: consider adding flags to alloc request #423

grondo · 2024-08-15T14:09:13Z

In flux-framework/flux-core#5739 there is a proposal to add a flag to preemptible/standby jobs. A scheduler can then use this flag to determine if a job may be canceled to make way for a higher priority job. However, currently flags are not shared with the scheduler, which would have to read the submit event to get the flags.

We should consider adding flags to the alloc request (which contains most of the rest of the submit event context).

The text was updated successfully, but these errors were encountered:

garlick · 2024-11-07T16:25:02Z

In #5378, @ryanday36 mentioned an alternate idea to a standby flag: allow a duration range in jobspec, where a standby job would specify a min < max duration and the job would be eligible for preemption after the min duration expires. For traditional standby, min = 0, but it would allow for a job to request to get a minimum amount of work done before preemption. It seems like we discussed this and decided it was a better idea than the flag but I can't find anything to cite here.

A duration range would require an update to

Shall we change this issue from flag to duration range?

grondo · 2024-11-07T16:35:20Z

Yes, that's a good idea. I wonder if it would be easier to add an optional minimum duration instead of turning duration into a range though...

grondo · 2024-11-07T16:39:25Z

BTW, I just recalled that when we discussed this proposal of a minimum duration one possible solution was to still use the preemptible flag, but to have the job-manager post the flag to the eventlog as soon as minimum runtime expired. We should explore if that makes sense, or if we leave it up to the scheduler to look for any minimum runtime and implement preemption internally without need of any kind of flags.

garlick · 2024-11-12T15:15:56Z

I wonder if it would be easier to add an optional minimum duration instead of turning duration into a range though..

Yeah good thought, since I think we would amend v1 jobspec and if we did the min/max thing (like the way count is defined in RFC 14) then we couldn't roll back to an earlier release without having newer jobs in the queue potentially contain invalid jobspec.

Maybe preemptible-after?

I just recalled that when we discussed this proposal of a minimum duration one possible solution was to still use the preemptible flag, but to have the job-manager post the flag to the eventlog as soon as minimum runtime expired. We should explore if that makes sense, or if we leave it up to the scheduler to look for any minimum runtime and implement preemption internally without need of any kind of flags.

Yes I like the idea of setting the flag to centralize processing of this time stuff. If exceeding preemptible-after causes a set-flags event to be posted to the job eventlog, then job-list can easily track the flag as well.

RFC 27 would not only need flags added to the alloc request and hello response, but also a new RPC for setting flags on existing jobs.

grondo · 2024-11-12T15:54:53Z

Maybe preemptible-after?

Good suggestion!

garlick · 2024-11-12T15:55:05Z

Well, thinking about this a bit more, it seems like the scheduler will want to incorporate future preemptibility into its "plan", so I'm not sure how useful it is to know when a job becomes preemptible in real time? Maybe it would be better to just have the scheduler raise a fatal exception on a job when it reaches that point in its schedule and skip the notification?

grondo · 2024-11-12T16:13:38Z

Yeah, that's a good thought. We should get the opinion of Fluxion developers here, since the original plan was to use a flag all along.

I imagine it is much easier to support scheduling of preemptible jobs that would have preemptible-after=0, than those with a minimum runtime. If a nontrivial amount of effort will be required to support preemptible-after, then it might be best to complete this work in stages: Support only preemptible-after=0 at first (essentially treating it as a flag), then later add the ability to extend this to nonzero values.

Either way it doesn't seem like the flag is useful, so we can probably close this issue or change it to add the preemptible-after attribute, which is only supported by Fluxion.

garlick · 2024-11-12T16:31:21Z

Either way it doesn't seem like the flag is useful, so we can probably close this issue or change it to add the preemptible-after attribute, which is only supported by Fluxion.

Before we discard that idea completely we might want to try a little prototype of preemptible-after with sched-simple. At minimum, we may find that some libschedutil support falls out of it that will save fluxion developers some work. Like I dunno, a wake up timer when running jobs become preemptible since otherwise the scheduler might not have any other event to wake up on?

grondo · 2024-11-12T16:50:16Z

Does libschedutil examine jobspec? In any event, an experiment with sched-simple is a good idea.

Since sched-simple is not a planning scheduler, preemptible jobs would not have much effect if they are submitted behind a pending job. However, it could be used to test that submission of a high priority job kills off any existing preemptible jobs if that would allow the job to run. 🤷

grondo mentioned this issue Aug 15, 2024

tracking issue: standby/preemptible jobs flux-framework/flux-core#5739

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC 27: consider adding flags to alloc request #423

RFC 27: consider adding flags to alloc request #423

grondo commented Aug 15, 2024

garlick commented Nov 7, 2024

grondo commented Nov 7, 2024 •

edited

Loading

grondo commented Nov 7, 2024

garlick commented Nov 12, 2024

grondo commented Nov 12, 2024

garlick commented Nov 12, 2024

grondo commented Nov 12, 2024

garlick commented Nov 12, 2024

grondo commented Nov 12, 2024

RFC 27: consider adding flags to alloc request #423

RFC 27: consider adding flags to alloc request #423

Comments

grondo commented Aug 15, 2024

garlick commented Nov 7, 2024

grondo commented Nov 7, 2024 • edited Loading

grondo commented Nov 7, 2024

garlick commented Nov 12, 2024

grondo commented Nov 12, 2024

garlick commented Nov 12, 2024

grondo commented Nov 12, 2024

garlick commented Nov 12, 2024

grondo commented Nov 12, 2024

grondo commented Nov 7, 2024 •

edited

Loading