-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDP: define minimally required additional job options #545
Comments
Quick note: Should probably be aligned with #471 |
about udf-dependency-archives: I think it even makes sense to attach this kind of info directly to the UDF, instead of indirectly associating it through the UDP that contains the UDF. This is kind of related to the idea being brainstormed at Open-EO/openeo-processes#374 to upgrade the current "just a string" Another path could be to use the
The current interpretation is that the context is passed directly as argument to the entrypoint function of the UDF, but in a more relaxed interpretation it could also serve to define extra runtime settings like The nice things about using |
@soxofaan indeed, but we still need to configure the other job options as well. |
+1 on adding the UDF options to the run_udf call. Adding job options to processes seems to mix concerns. The process could in principle also be executed in other modes, what happens then? What if I load the process into the Web Editor and change the extent to reasonable small or utterly large and execute it then? So my thinking is that adding such options to a process is not inline with the initial vision for openEO, especially when it's for CPU/mem consumption, which was always meant to be abstracted away. I think if job options are important and you want a job to be exeucted as a job, you need to actually schare the job metadata, not just the process, so pretty much a partial body for the POST /jobs request with Thinking about it a bit more now, could job options also be provided as process? For example:
Just a spontaneaous idea. Still somewhat mixing concerns, but doesn't need a spec change and is more visible to users. Thoughts? |
This feels a bit too procedural/stateful to me and as such conflicts with the openEO concept of expressing your workflow as a graph of linked processing nodes. How would these |
That's a good question and there's no obvious solution yet. Having unconnected nodes is probably a bit of a hassle... On the other hand, adding job metadata to the process is also not very clean as pointed out above. So maybe it's really sharing jobs (i.e. job metadata) instead of processes in this case? Similar things could appear for web services, where you'd also don't want to add the metadata for the service creation to the process, but instead probably share web service metadata. |
Any more thoughts/discussions? |
Because you asked 😄 Another consideration against something like |
I agree that adding it as a process graph is not ideal, so let's drop that idea. |
I think there are quite some hurdles with this:
|
No, I meant to share what you send to POST /jobs, so just the JSON.... |
Well yes that's basically what we want to do, but instead of a concrete batch job with job options, we want it for a deployment-agnostic parameterized (remote) process with recommended/minimum job options And to make this a bit more concrete: for batch jobs you have
While UDPs and by consequence remote process definitions follow this format:
Note that with batch jobs we put job options at the top level, but UDPs start a level deeper, so the best we can do is to put it one level deeper |
Thanks for the examples. The title of the issue "define minimally required additional job options" is different to what you propose with "recommended/minimum job options"... so what do you actually want?
Job options are usually backend specific, so that doesn't go together for me with "deployment-agnostic".
Nothing prevents to share a Job Creation JSON instead of a UDP... If we really want to make such options (memory, CPU, etc) a thing in openEO, they should be standardized across backends. |
I think @jdries and I mean the same with "minimally required" and "recommended/minimum" job options: job options that one is recommended to use, unless they know very well what they are doing.
I intentionally used "deployment-agnostic" instead of "backend-agnostic", because the job options used in the examples here would apply to any backend deployment powered by the geopyspark-driver, so not just the "VITO" backend, but also CDSE (and consequently the CDSE federation too), other project-specific deployments we internally have, future deployments leveraging EOEPCA building blocks, etc.
Kind of disagree: we want to build on the remote process defintion extension, which follows the UDP schemas. Going for the job creation schema instead would cause incompatibility issues with remote process definition related tooling and support.
I understand that this conflicts a bit with the general openEO vision, but note that most users or use cases do not have to worry about explicitly setting job options. This is just a pragmatic approach for us in more advanced and heavy use cases.
Agree, and I think this feature request actually is in line with that: without deciding about the actual job option fields (which would lead us too far here), we can already find a convention about their location in the relevant schemas and JSON structures. |
Another thought about this. These job options are usually necessary for use cases with UDFs that are quite heavy in some sense (e.g. consuming large chunks of data, or doing non-trivial ML operations). Unlike with pre-defined openEO processes, back-ends can not automatically estimate necessary resource allocation to support the UDFs, so we need some kind of user-provided indicators here to get this working in practice. |
I've put this on the PSC agenda. This request is not really in line with the original vision of openEO (keep away this complexity from the user + interoperability), so want to hear from the PSC how they think about it. |
Some UDP's depend on very specific job options to run them successfully. For instance:
Here, udf-dependency-archives is really mandatory.
As for the other options, these can usually be considered as lower bounds. (I don't immediately know of a case where an upper bound would be relevant.)
Here's an example, where I called it 'minimal_job_options':
https://github.com/ESA-APEx/apex_algorithms/blob/dd81a53463e5c913e09329dc02832e8db5a6350e/openeo_udp/worldcereal_inference.json#L1416
The text was updated successfully, but these errors were encountered: