Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MLM JSON schema to enforce mlm:input band checks as intended #39

Merged
merged 10 commits into from
Sep 27, 2024
26 changes: 22 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,37 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased](https://github.com/crim-ca/mlm-extension/tree/main)

### Added
- n/a
- Add `raster:bands` required property `name` for describing `mlm:input` bands
(see [README - Bands and Statistics](README.md#bands-and-statistics) for details).
- Add README warnings about new extension `eo` and `raster` versions.

### Changed
- n/a
- Split `ModelBands` and `AnyBandsRef` definitions in the JSON schema to allow them to be referenced individually.
- Move `AnyBandsRef` definition explicitly to STAC Item JSON schema, rather than implicitly inferred via `mlm:input`.
- Modified the JSON schema to use a `if` check of the `type` (STAC Item or Collection) prior to validating further
properties. This allows some validators (e.g. `pystac`) to better report the *real* error that causes the schema
to fail, rather than reporting the first mismatching `type` case with a poor error description to debug the issue.

### Deprecated
- n/a

### Removed
- n/a
- Removed `$comment` entries from the JSON schema that are considered as invalid by some parsers.
- When `mlm:input` objects do **NOT** define band references (i.e.: `bands: []` is used), the JSON schema will not
fail if an Asset with the `mlm:model` role contains a band definition. This is to allow MLM model definitions to
simultaneously use some inputs with `bands` reference names while others do not.

### Fixed
- n/a
- Band checks against [`eo`](https://github.com/stac-extensions/eo), [`raster`](https://github.com/stac-extensions/eo)
or STAC Core 1.1 [`bands`](https://github.com/radiantearth/stac-spec/blob/master/commons/common-metadata.md#bands)
when a `mlm:input` references names in `bands` are now properly validated.
- Fix the examples using `raster:bands` incorrectly defined in STAC Item properties.
The correct use is for them to be defined under the STAC Asset using the `mlm:model` role.
- Fix the [EuroSAT ResNet pydantic example](./stac_model/examples.py) that incorrectly referenced some `bands`
in its `mlm:input` definition without providing any definition of those bands. The `eo:bands` properties have
been added to the corresponding `model` Asset using
the [`pystac.extensions.eo`](https://github.com/stac-utils/pystac/blob/main/pystac/extensions/eo.py) utilities.
- Fix various STAC Asset definitions erroneously employing `mlm:model` role instead of the intended `mlm:source_code`.

## [v1.2.0](https://github.com/crim-ca/mlm-extension/tree/v1.2.0)

Expand Down
34 changes: 24 additions & 10 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ make install-dev
make pre-commit-install
```

## PR submittion
## PR submission

Before submitting your code please do the following steps:

Expand All @@ -41,7 +41,7 @@ make lint-all
make test
```

5. Upload your changes to your fork, then make a PR from there to the main repo:
6. Upload your changes to your fork, then make a PR from there to the main repo:

```bash
git checkout -b your-branch
Expand All @@ -53,11 +53,15 @@ git push -u origin your-branch

## Building and releasing

> :warning: <br>
<!-- lint disable no-undefined-references -->

> [!WARNING]
> There are multiple types of releases for this repository: <br>
> 1. Release for MLM specification (usually, this should include one for `stac-model` as well to support it)
> 2. Release for `stac-model` only

<!-- lint enable no-undefined-references -->

### Building a new version of MLM specification

- Checkout to the `main` branch by making sure the CI passed all previous tests.
Expand All @@ -69,9 +73,14 @@ git push -u origin your-branch
- Make a commit to `GitHub` and push the corresponding auto-generated `v{MAJOR}.{MINOR}.{PATCH}` tag.
- Validate that the CI validated everything once again.
- Create a `GitHub release` with the created tag.
> :warning: <br>
> - Ensure the "Set as the latest release" option is selected :heavy_check_mark:.
> - Ensure the diff ranges from the previous MLM version, and not an intermediate `stac-model` release.

<!-- lint disable no-undefined-references -->

> [!WARNING]
> - Ensure the "Set as the latest release" option is selected :heavy_check_mark:.
> - Ensure the diff ranges from the previous MLM version, and not an intermediate `stac-model` release.

<!-- lint enable no-undefined-references -->

### Building a new version of `stac-model`

Expand All @@ -83,17 +92,22 @@ git push -u origin your-branch
- Checkout to `main` branch that contais the freshly created merge commit.
- Push the tag `stac-model-v{MAJOR}.{MINOR}.{PATCH}`. The CI should auto-publish it to PyPI.
- Create a `GitHub release`
> :warning: <br>
> - Ensure the "Set as the latest release" option is deselected :x:.
> - Ensure the diff ranges from the previous release of `stac-model`, not an intermediate MLM release.

<!-- lint disable no-undefined-references -->

> [!WARNING]
> - Ensure the "Set as the latest release" option is deselected :x:.
> - Ensure the diff ranges from the previous release of `stac-model`, not an intermediate MLM release.

<!-- lint enable no-undefined-references -->

## Other help

You can contribute by spreading a word about this library.
It would also be a huge contribution to write
a short article on how you are using this project.
You can also share how the ML Model extension does or does
not serve your needs with us in the Github Discussions or raise
not serve your needs with us in the GitHub Discussions or raise
Issues for bugs.

[poetry-install]: https://github.com/python-poetry/install.python-poetry.org
Expand Down
39 changes: 20 additions & 19 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
#* Variables
SHELL := /usr/bin/env bash
PYTHON := python
SHELL ?= /usr/bin/env bash
PYTHON ?= python
PYTHONPATH := `pwd`
POETRY ?= poetry

#* Poetry
.PHONY: poetry-install
Expand All @@ -14,36 +15,36 @@ poetry-remove:

.PHONY: poetry-plugins
poetry-plugins:
poetry self add poetry-plugin-up
$(POETRY) self add poetry-plugin-up

.PHONY: poetry-env
poetry-env:
poetry config virtualenvs.in-project true
$(POETRY) config virtualenvs.in-project true

.PHONY: publish
publish:
poetry publish --build
$(POETRY) publish --build

#* Installation
.PHONY: install
install: poetry-env
poetry lock -n && poetry export --without-hashes > requirements-lock.txt
poetry install -n
$(POETRY) lock -n && poetry export --without-hashes > requirements-lock.txt
$(POETRY) install -n
-poetry run mypy --install-types --non-interactive ./

.PHONY: install-dev
install-dev: poetry-env install
poetry install -n --with dev
$(POETRY) install -n --with dev

.PHONY: pre-commit-install
pre-commit-install:
poetry run pre-commit install
$(POETRY) run pre-commit install


#* Formatters
.PHONY: codestyle
codestyle:
poetry run ruff format --config=pyproject.toml stac_model tests
$(POETRY) run ruff format --config=pyproject.toml stac_model tests

.PHONY: format
format: codestyle
Expand All @@ -61,29 +62,29 @@ check-all: check

.PHONY: mypy
mypy:
poetry run mypy --config-file pyproject.toml ./
$(POETRY) run mypy --config-file pyproject.toml ./

.PHONY: check-mypy
check-mypy: mypy

.PHONY: check-safety
check-safety:
poetry check
poetry run safety check --full-report
poetry run bandit -ll --recursive stac_model tests
$(POETRY) check
$(POETRY) run safety check --full-report
$(POETRY) run bandit -ll --recursive stac_model tests

.PHONY: lint
lint:
poetry run ruff --config=pyproject.toml ./
poetry run pydocstyle --count --config=pyproject.toml ./
poetry run pydoclint --config=pyproject.toml ./
$(POETRY) run ruff --config=pyproject.toml ./
$(POETRY) run pydocstyle --count --config=pyproject.toml ./
$(POETRY) run pydoclint --config=pyproject.toml ./

.PHONY: check-lint
check-lint: lint

.PHONY: format-lint
format-lint:
poetry run ruff --config=pyproject.toml --fix ./
$(POETRY) run ruff --config=pyproject.toml --fix ./

.PHONY: install-npm
install-npm:
Expand Down Expand Up @@ -113,7 +114,7 @@ lint-all: lint mypy check-safety check-markdown

.PHONY: update-dev-deps
update-dev-deps:
poetry up --only=dev-dependencies --latest
$(POETRY) up --only=dev-dependencies --latest

#* Cleaning
.PHONY: pycache-remove
Expand Down
63 changes: 48 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,14 +224,18 @@ It is recommended to define `accelerator` with one of the following values:
- `intel-ipex-gpu` for models optimized with IPEX for Intel GPUs
- `macos-arm` for models trained on Apple Silicon

> :warning: <br>
<!-- lint disable no-undefined-references -->

> [!WARNING]
> If `mlm:accelerator = amd64`, this explicitly indicates that the model does not (and will not try to) use any
> accelerator, even if some are available from the runtime environment. This is to be distinguished from
> the value `mlm:accelerator = null`, which means that the model *could* make use of some accelerators if provided,
> but is not constrained by any specific one. To improve comprehension by users, it is recommended that any model
> using `mlm:accelerator = amd64` also set explicitly `mlm:accelerator_constrained = true` to illustrate that the
> model **WILL NOT** use accelerators, although the hardware resolution should be identical nonetheless.

<!-- lint enable no-undefined-references -->

When `mlm:accelerator = null` is employed, the value of `mlm:accelerator_constrained` can be ignored, since even if
set to `true`, there would be no `accelerator` to contain against. To avoid confusion, it is suggested to set the
`mlm:accelerator_constrained = false` or omit the field entirely in this case.
Expand Down Expand Up @@ -265,20 +269,37 @@ representing bands information, including notably the `nodata` value,
the `data_type` (see also [Data Type Enum](#data-type-enum)),
and [Common Band Names][stac-band-names].

> :information_source: <br>
<!-- lint disable no-undefined-references -->

> [!WARNING]
> Only versions `v1.x` of `eo` and `raster` are supported to provide `mlm:input` band references.
> Versions `2.x` of those extensions rely on the [STAC 1.1 - Band Object][stac-1.1-band] instead.
> If those versions are desired, consider migrating your MLM definition to use [STAC 1.1 - Band Object][stac-1.1-band]
> as well for referencing `mlm:input` with band names.

> [!NOTE]
> Due to how the schema for [`eo:bands`][stac-eo-band] is defined, it is not sufficient to *only* provide
> the `eo:bands` property at the STAC Item level. The schema validation of the EO extension explicitly looks
> for a corresponding set of bands under an Asset, and if none is found, it disallows `eo:bands` in the Item properties.
> Therefore, `eo:bands` should either be specified *only* under the Asset containing the `mlm:model` role
> (see [Model Asset](#model-asset)), or define them *both* under the Asset and Item properties. If the second
> approach is selected, it is recommended that the `eo:bands` under the Asset contains only the `name` or the
> `common_name` property, such that all other details about the bands are defined at the Item level.
> An example of such representation is provided in
> [examples/item_eo_bands_summarized.json](examples/item_eo_bands_summarized.json).
> <br><br>
> For an example where `eo:bands` are entirely defined in the Asset on their own, please refer to
> [examples/item_eo_bands.json](examples/item_eo_bands.json) instead.
> <br><br>
> For more details, refer to [stac-extensions/eo#12](https://github.com/stac-extensions/eo/issues/12).
> <br>
> For an example, please refer to [examples/item_eo_bands.json](examples/item_eo_bands.json).
> Notably in this example, the `assets.weights.eo:bands` property provides the `name` to fulfill the Asset requirement,
> while all additional band details are provided in `properties.eo:bands`.

> [!NOTE]
> When using `raster:bands`, and additional `name` parameter **MUST** be provided for each band. This parameter
> is not defined in `raster` extension itself, but is permitted. This addition is required to ensure
> that `mlm:input` bands referenced by name can be associated to their respective `raster:bands` definitions.

<!-- lint enable no-undefined-references -->

Only bands used as input to the model should be included in the MLM `bands` field.
To avoid duplicating the information, MLM only uses the `name` of whichever "Band Object" is defined in the STAC Item.
Expand All @@ -294,12 +315,12 @@ to normalize all bands, rather than normalizing the values over a single product
applied differently for distinct [Model Input](#model-input-object) definitions, in order to adjust for intrinsic
properties of the model.

[stac-1.1-band]: https://github.com/radiantearth/stac-spec/pull/1254
[stac-1.1-stats]: https://github.com/radiantearth/stac-spec/blob/bands/item-spec/common-metadata.md#statistics-object
[stac-eo-band]: https://github.com/stac-extensions/eo?tab=readme-ov-file#band-object
[stac-raster-band]: https://github.com/stac-extensions/raster?tab=readme-ov-file#raster-band-object
[stac-raster-stats]: https://github.com/stac-extensions/raster?tab=readme-ov-file#statistics-object
[stac-band-names]: https://github.com/stac-extensions/eo?tab=readme-ov-file#common-band-names
[stac-1.1-band]: https://github.com/radiantearth/stac-spec/blob/v1.1.0/commons/common-metadata.md#bands
[stac-1.1-stats]: https://github.com/radiantearth/stac-spec/blob/v1.1.0/commons/common-metadata.md#statistics-object
[stac-eo-band]: https://github.com/stac-extensions/eo/tree/v1.1.0#band-object
[stac-raster-band]: https://github.com/stac-extensions/raster/tree/v1.1.0#raster-band-object
[stac-raster-stats]: https://github.com/stac-extensions/raster/tree/v1.1.0#statistics-object
[stac-band-names]: https://github.com/stac-extensions/eo#common-band-names

#### Model Band Object

Expand All @@ -309,10 +330,14 @@ properties of the model.
| format | string | The type of expression that is specified in the `expression` property. |
| expression | \* | An expression compliant with the `format` specified. The expression can be applied to any data type and depends on the `format` given. |

> :information_source: <br>
<!-- lint disable no-undefined-references -->

> [!NOTE]
> Although `format` and `expression` are not required in this context, they are mutually dependent on each other. <br>
> See also [Processing Expression](#processing-expression) for more details and examples.

<!-- lint enable no-undefined-references -->

The `format` and `expression` properties can serve multiple purpose.

1. Applying a band-specific pre-processing step,
Expand Down Expand Up @@ -441,14 +466,18 @@ the following formats are recommended as alternative scripts and function refere
| `docker` | string | An URI with image and tag to a Docker. | `ghcr.io/NAMESPACE/IMAGE_NAME:latest` |
| `uri` | string | An URI to some binary or script. | `{"href": "https://raw.githubusercontent.com/ORG/REPO/TAG/package/cli.py", "type": "text/x-python"}` |

> :information_source: <br>
<!-- lint disable no-undefined-references -->

> [!NOTE]
> Above definitions are only indicative, and more can be added as desired with even more custom definitions.
> It is left as an implementation detail for users to resolve how these expressions should be handled at runtime.

> :warning: <br>
> [!WARNING]
> See also discussion regarding additional processing expressions:
> [stac-extensions/processing#31](https://github.com/stac-extensions/processing/issues/31)

<!-- lint enable no-undefined-references -->

[stac-proc-expr]: https://github.com/stac-extensions/processing#expression-object

### Model Output Object
Expand Down Expand Up @@ -543,10 +572,14 @@ In order to provide more context, the following roles are also recommended were
| mlm:model | `model` | Required role for [Model Asset](#model-asset). |
| mlm:source_code | `code` | Required role for [Model Asset](#source-code-asset). |

> :information_source: <br>
<!-- lint disable no-undefined-references -->

> [!NOTE]
> (*) These roles are offered as direct conversions from the previous extension
> that provided [ML-Model Asset Roles][ml-model-asset-roles] to provide easier upgrade to the MLM extension.

<!-- lint enable no-undefined-references -->

[ml-model-asset-roles]: https://github.com/stac-extensions/ml-model?tab=readme-ov-file#asset-objects

### Model Asset
Expand Down
6 changes: 5 additions & 1 deletion README_DLM_LEGACY.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# Deep Learning Model (DLM) Extension

> :information_source: <br>
<!-- lint disable no-undefined-references -->

> [!NOTE]
> This is legacy documentation references of Deep Learning Model extension
> preceding the current Machine Learning Model (MLM) extension.

<!-- lint enable no-undefined-references -->

Check the original [Technical Report](https://github.com/crim-ca/CCCOT03/raw/main/CCCOT03_Rapport%20Final_FINAL_EN.pdf).

![Image Description](https://i.imgur.com/cVAg5sA.png)
Loading
Loading