Skip to content

Latest commit

 

History

History
61 lines (40 loc) · 2.76 KB

0003-test-train-validation-split.md

File metadata and controls

61 lines (40 loc) · 2.76 KB

3. Test-train-validation split

Date: 2020-08-10

Status

Accepted

Context

During model training its important to have a consistent split between training, testing and validation data.

  • Training subset is used to tune model weights.
  • Test subset is used to monitor training progress and hyper-parameter turning.
  • Validation subset is used to judge overall model performance.

Best practices dictate that it is critical that these datasets do not overlap. The which items are selected for this split will effect model performance and should be captured in the ml-aoi catalog.

In context of a STAC catalog there are multiple ways to express the data split. This ADR explores available options and their consequences.

Split by Collection

Split could be generated by generating a separate collection for each set. This is a flexible approach. However, the grouping of these collections into one cohesive training set would have to be done by convention, for instance by prefix on collection id. Additionally these collections could not be easily visualized together. Most (all?) existing STAC viewers are focused on browsing or viewing one collection at a time.

Additionally the convention of how to associate training with testing with validation set would have to be propagated into downstream tooling. Further it would be easy to include a single item in both training and testing set without realizing it. This is not a good choice for these reasons.

Split by Link property

The top-most ml-aoi collection has to link to each item or child catalogs. These links could have additional property that designates the split. This approach keeps all the items with in the same collection, which is good.

However, when ingested into STAC API this link property is often lost and is not easily queried. Thus the split set membership would not be visible to through STAC API, which is bad. This is not a good choice for that reason.

Split by Item property

Each item could have an extension specific property (ex: ml-aoi:split) that designates set membership. This approach addresses the short-comings of the previous methods.

This property can be easily searched for after item is ingested into STAC API. Following this method it is not possible to include a single item in multiple sets. Collection can be viewed by tools that do not understand ml-aoi extension.

Decision

Test, Train, Validation split should be handled by ml-aoi:split Item property. Keeping the all items, regardless of the role, grouped in a single collection provides best integration with other STAC tools. Expected use case is visual inspection of items on a single map with role membership used to color the footprint polygons.

Consequences

Future ml-aoi catalogs should include ml-aoi:split property.