Skip to content

Commit

Permalink
Regression example of Clay (#285)
Browse files Browse the repository at this point in the history
- Add datamodule & model class for biomasters regression example
- Add notebooks to show the inference for biomasters & chesapeake bay
  • Loading branch information
srmsoumya authored Jul 1, 2024
1 parent 58af79f commit 7a82f5c
Show file tree
Hide file tree
Showing 14 changed files with 1,760 additions and 6 deletions.
5 changes: 4 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,10 @@ repos:
hooks:
- id: ruff # Run the linter
args: [ --fix ]
types_or: [ python, pyi, jupyter ]
types_or: [ python, pyi ]
- id: ruff # Run the linter for Jupyter notebooks with the PLR0913 rule ignored
args: [ --fix, --ignore=PLR0913 ]
types: [ jupyter ]
- id: ruff-format # Run the formatter
types_or: [ python, pyi, jupyter ]

Expand Down
2 changes: 1 addition & 1 deletion classify.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
# %%
def cli_main():
"""
Command-line inteface to run ClayMAE with ClayDataModule.
Command-line inteface to run Clasifier model with EuroSATDataModule.
"""
cli = LightningCLI(EuroSATClassifier, EuroSATDataModule)
return cli
Expand Down
63 changes: 63 additions & 0 deletions configs/regression_biomasters.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# lightning.pytorch==2.1.2
seed_everything: 42
data:
metadata_path: configs/metadata.yaml
batch_size: 10
num_workers: 8
train_chip_dir: data/biomasters/train_cube
train_label_dir: data/biomasters/train_agbm
val_chip_dir: data/biomasters/test_cube
val_label_dir: data/biomasters/test_agbm
model:
ckpt_path: checkpoints/clay-v1-base.ckpt
lr: 1e-3
wd: 0.05
b1: 0.9
b2: 0.95
feature_maps:
- 2
- 5
- 7
- 9
- 11
trainer:
accelerator: auto
strategy: ddp
devices: auto
num_nodes: 1
precision: bf16-mixed
log_every_n_steps: 5
max_epochs: 100
default_root_dir: checkpoints/regression
fast_dev_run: False
num_sanity_val_steps: 0
# limit_train_batches: 0.25
# limit_val_batches: 0.25
accumulate_grad_batches: 4
logger:
- class_path: lightning.pytorch.loggers.WandbLogger
init_args:
entity: developmentseed
project: clay-regression
log_model: false
callbacks:
- class_path: lightning.pytorch.callbacks.ModelCheckpoint
init_args:
dirpath: checkpoints/regression
auto_insert_metric_name: False
filename: biomasters_epoch-{epoch:02d}_val-score-{val/score:.3f}
monitor: val/score
mode: min
save_last: False
save_top_k: 2
save_weights_only: True
verbose: True
- class_path: lightning.pytorch.callbacks.LearningRateMonitor
init_args:
logging_interval: step
- class_path: src.callbacks.LayerwiseFinetuning
init_args:
phase: 10
train_bn: True
plugins:
- class_path: lightning.pytorch.plugins.io.AsyncCheckpointIO
177 changes: 177 additions & 0 deletions finetune/regression/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
## Download data
The data comes as multifile zip, it can be downloaded from the
[BioMassters](https://huggingface.co/datasets/nascetti-a/BioMassters/)
huggingface repository. Grab a coffee, this is about 250GB in size.

The next step is to unzip training data. The data comes in a multi-file
zip archive. So it needs to be unzipped using a library that can handle
the format. 7z works quite well in this case. Grabb another coffee, this
will take a while.

```bash
sudo apt install p7zip-full
```

### Extract train feature


```bash
7z e -o/home/tam/Desktop/biomasters/train_features/ /datadisk/biomasters/raw/train_features.zip
```

Should look something like this

```
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,16 CPUs Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz (A0652),ASM,AES-NI)
Scanning the drive for archives:
1 file, 10247884383 bytes (9774 MiB)
Extracting archive: /datadisk/biomasters/raw/train_features.zip
--
Path = /datadisk/biomasters/raw/train_features.zip
Type = zip
Physical Size = 10247884383
Embedded Stub Size = 4
64-bit = +
Total Physical Size = 149834321503
Multivolume = +
Volume Index = 13
Volumes = 14
Everything is Ok
Folders: 1
Files: 189078
Size: 231859243932
Compressed: 149834321503
```

### Extract train AGBM

```bash
7z e -o/home/tam/Desktop/biomasters/train_agbm/ /datadisk/biomasters/raw/train_agbm.zip
```

Should look something like this

```
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,16 CPUs Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz (A0652),ASM,AES-NI)
Scanning the drive for archives:
1 file, 575973495 bytes (550 MiB)
Extracting archive: /datadisk/biomasters/raw/train_agbm.zip
--
Path = /datadisk/biomasters/raw/train_agbm.zip
Type = zip
Physical Size = 575973495
Everything is Ok
Folders: 1
Files: 8689
Size: 2280706098
Compressed: 575973495
```

### Extract test features

```bash
7z e -o/home/tam/Desktop/biomasters/test_features/ /datadisk/biomasters/raw/test_features_splits.zip
```

Should look something like this

```
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,16 CPUs Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz (A0652),ASM,AES-NI)
Scanning the drive for archives:
1 file, 6912625480 bytes (6593 MiB)
Extracting archive: /datadisk/biomasters/raw/test_features_splits.zip
--
Path = /datadisk/biomasters/raw/test_features_splits.zip
Type = zip
Physical Size = 6912625480
Embedded Stub Size = 4
64-bit = +
Total Physical Size = 49862298440
Multivolume = +
Volume Index = 4
Volumes = 5
Everything is Ok
Folders: 1
Files: 63348
Size: 78334396224
Compressed: 49862298440
```

### Extract test AGBM

```bash
7z e -o/home/tam/Desktop/biomasters/test_agbm/ /datadisk/biomasters/raw/test_agbm.tar
```

Should look something like this

```
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,16 CPUs Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz (A0652),ASM,AES-NI)
Scanning the drive for archives:
1 file, 729766400 bytes (696 MiB)
Extracting archive: /datadisk/biomasters/raw/test_agbm.tar
--
Path = /datadisk/biomasters/raw/test_agbm.tar
Type = tar
Physical Size = 729766400
Headers Size = 1421312
Code Page = UTF-8
Everything is Ok
Folders: 1
Files: 2773
Size: 727862586
Compressed: 729766400
```

## Prepare data

This will take the average of all timesteps available for each tile.
The time steps for Sentinel-2 are not complete, not all months are
provided for all tiles. In addtion, the Clay model does not take time
series as input. So aggregating the time element is simplifying but
ok for the purpose of this example.

**In addition, we skip the one orbit because it nodata most of the time**


### Prepare training features

```bash
python finetune/regression/preprocess_data.py \
--features=/home/tam/Desktop/biomasters/train_features/ \
--cubes=/home/tam/Desktop/biomasters/train_cubes/ \
--processes=12 \
--sample=1 \
--overwrite
```

### Prepare test features

```bash
python finetune/regression/preprocess_data.py \
--features=/home/tam/Desktop/biomasters/test_features/ \
--cubes=/home/tam/Desktop/biomasters/test_cubes/ \
--processes=12 \
--sample=1 \
--overwrite
```
Loading

0 comments on commit 7a82f5c

Please sign in to comment.