Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document terrascope job options #53

Merged
merged 3 commits into from
Nov 21, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 50 additions & 2 deletions federation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,13 +176,61 @@ on the appropriate processing back-ends.
Subsequent interaction (starting the jobs, polling their status, requesting the result assets, ...)
can be done through the "master" `job` object created above, in the same way as with normal batch jobs.

### Validity signer URLs (Batch job results)
### Validity of signed URLs in batch job results

Batch job results are accessible to the user via signed URLs stored in the result assets. Within the platform,
these URLs have a validity (expiry time) of 7 days. Within these 7 days, the results of a batch job can be accessed
by any person with the URL. Each time a user requests the results from the job endpoint (`GET /jobs/{job_id}/results`),
a freshly signed URL (valid for 7 days) is created for the result assets.

### Customizing batch job resources on Terrascope

Jobs running on the (Terrascope) cluster get assigned a default amount of CPU and memory resources. This
may not always be enough for your job, for instance when using UDF's. Also for very large jobs, you may want
to tune your resource settings to optimize for cost.

The example below shows how to start a job with all options set to their default values. It is important to highlight
that default settings are subject to change by the backend whenever needed.

```python
job_options = {
"executor-memory": "2G",
"executor-memoryOverhead": "3G",
"executor-cores": 2,
"task-cpus": 1,
"executor-request-cores": "400m",
"max-executors": "100",
"driver-memory": "8G",
"driver-memoryOverhead": "2G",
"driver-cores": 5,
}
cube.execute_batch(job_options=job_options)
```

This is a short overview of the various options:

- executor-memory: memory assigned to your workers, for the JVM that executes most predefined processes
- executor-memoryOverhead: memory assigned on top of the JVM, for instance to run UDF's
- executor-cores: number of CPUs per worker (executor). The number of parallel tasks is executor-cores/task-cpus
- task-cpus: CPUs assigned to a single task. UDF's using libraries like Tensorflow can benefit from further parallellization on the level of individual tasks.
- executor-request-cores: this settings is only relevant for Kubernetes based backends, allows to overcommit CPU
- max-executors: the maximum number of workers assigned to your job. Maximum number of parallel tasks is `max-executors*executor-cores/task-cpus`. Increasing this can inflate your costs, while not necessarily improving performance!
- driver-memory: memory assigned to the spark 'driver' JVM that controls execution of your batch job
- driver-memoryOverhead: memory assigned to the spark 'driver' on top of JVM memory, for Python processes.

#### Learning more

The topic of resource optimization is a complex one, and here we just give a short summary. The goal of openEO is to hide most of these
details from the user, but we realize that advanced users sometimes want to have a bit more insight, so in the spirit of being open, we give some hints.

To learn more about these options, we point to the piece of code that handles this:

https://github.com/Open-EO/openeo-geopyspark-driver/blob/faf5d5364a82e870e42efd2a8aee9742f305da9f/openeogeotrellis/backend.py#L1213

Most memory related options are translated to Apache Spark configuration settings, which are documented here:

https://spark.apache.org/docs/3.3.1/configuration.html#application-properties

### Batch job results on Sentinel Hub

If you are processing data and the underlying back-end is Sentinel Hub, the output extent of your batch job results is currently larger than your input extent because Sentinel Hub processes whole tiles (this may change in the future and the data will be cropped to your input extent).
If you are processing data and the underlying back-end is Sentinel Hub, the output extent of your batch job results is currently larger than your input extent because Sentinel Hub processes whole tiles (this may change in the future and the data will be cropped to your input extent).