Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nbdev for Clay Documentation, Clay SDK and Clay notebooks all at once #102

Closed
wants to merge 44 commits into from

Conversation

brunosan
Copy link
Member

I've started to work on documentation. I decided to start over using nbdev

It's a bit convoluted and I'm over my head coding here, but it seems highly attractive to do all this all at once without extra work:

  • Explore the model with notebooks anyone can run.
  • All functions and Classes created can be pip installed, making it a small SDK for Clay.
  • Notebooks help generate documentation website for all functions, with links to the code
  • Documentation and sample code tests run on CI.

@weiji14 your work is still in the history, and we can revert to that. I've migrated your text to this PR.

It's early, but I wanted to give a heads up in case others have experience on this.

GH Action is already rendering a draft site on https://clay-foundation.github.io/model/

I've also focused on the v0 release notes, but we still miss a lot of information. WIP.

image

brunosan and others added 30 commits September 29, 2023 14:08
* Initial conda environment and binder links

Add conda dependency specification file and getting started instructions in the main README.md. The conda environment.yml is paired with a conda-lock.yml lockfile for full reproducibility. Main README.md contains contains quickstart buttons for Binder/Planetary Computer/SageMaker Studio Lab, and steps for installation and usage locally.

* ➕ Add zarr

An implementation of chunked, compressed, N-dimensional arrays for Python! Repo at https://github.com/zarr-developers/zarr-python
* ➕ Add jsonargparse[signatures]

Parsing of command line options, yaml/jsonnet config files and/or environment variables based on argparse! Also adding the signatures extras (which includes typeshed-client).

* 🌱 Setup LightningCLI trainer script

Setting up the command-line interface to run Lightning. Created a placeholder BaseDataModule and BaseLitModule to hold the data pipeline and model architecture respectively under the src/ folder. Documented in the main README.md on how to run the LightningCLI commands, and also created a src/README.md to document what the python modules in that folder.
* 🙈 Add .gitignore file

* 👷 Setup GitHub Actions Continuous Integration tests

Running tests on Ubuntu-22.04 and Python 3.11 only for now. Add a parametrized test to ensure that `python trainer.py fit --print_config=skip_null` works (as well as the validate/test subcommands). Tests are ran using `python -m pytest src/tests/`.
* 🔧 Add pre-commit config with pre-commit-hooks and ruff

Adding a .pre-commit-config.yaml file with some pre-commit hooks and the ruff linter/formatter. The ruff linter is configured to do autofix, and will run on python scripts and jupyter notebooks.

* 🔧 Configure ruff rules with pyproject.toml file

Enforce certain ruff formatting and lint rules such as UNIX-style line-endings, pycodestyle, pyflakes, isort, numpy, pylint and pyupgrade.

* 🚨 Fix F841 by returning cli variable

Fix `F841: Local variable `cli` is assigned to but never used` on trainer.py.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 🔧 Configure pre-commit.ci to run autoupdates on quarterly basis

Setting up pre-commit.ci to only run updates quarterly instead of the weekly default. Also explicitly stating that Pull Requests will be autofixed.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Geographic pandas extensions!
* combining data arrays into multi-sensor data cube

* implement cql2-json filters for S2 and S1

* script to generate merged datacube

* wip function for calculating sentinel 1 scene with max coverage of bbox

* formatting

* move script and remove notebook

* use geom instead of geodataframe for initial aoi

* use args in funcs

* use CENTROID for geom in cql2 query

* Use mosaic method, set singular time dimension based on Sentinel 2

* add configurable args for cloud cover percentage and nodata percentage

* use epsg code derived from Sentinel-2 properties, filter by best cloud-free conditions and orbit state

* remove extra filter

* map s2 for best image using datetime to id, set s2 bands as unique vars, mosaic s1 on time

* assign S2 time as dataset variable

* remove orbit filter

* wrap example in main

* move script to subdir

* use cloud variable

* use cloud variable

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* 🚨 Fix E501 Line too long

Wrapping docstrings in scripts/datacube.py to under 88 characters.

* ♻️ Refactor best_nodata and best_clouds into single sort function

Fixes F841 Local variable `best_nodata` is assigned to but never used. Only the best_clouds variable was used, and best_nodata was omitted, but both should be used. Doing this in a single pandas sort_values function.

* 🚑 Quickfix with getting the STAC item with a specific datetime

Patch cc99ae4

* 🏷️ Rename variables to ds_ (xr.Dataset) or da (xr.DataArray)

Using ds_ prefix for xr.Dataset objects, and da_ prefix for xr.DataArray objects.

* 🔧 Set pylint max-args to 6

Increase from default value of 5 to 6,

* 🗑️ Replace .get_all_items() with .item_collection()

Fixes `FutureWarning: get_all_items() is deprecated, use item_collection() instead`.

* 🔥 Remove sorting by nodata and just sort by least cloud cover

No need to sort by `s2:nodata_pixel_percentage` anymore, just get the Sentinel-2 STAC item with the least cloud cover.

* 📝 More DataArray to Dataset renames

Missed a few more da_ to ds_ renames, following from 2af24be
* Add landcover based sampling scripts

Closes #28

* Drop duplicates, fix typo, uncomment compute_stats function.

* Fix comment that was out of sync with code
replace placeholder with our name and date
* Initial tile module that generates 256x256 tiled xarray datasets from the larger scene-level datacube

* update comments

* update comments

* add doctsings and initial cloud and nodata filter

* more efficient cloud and nodata filter

* example script to run the datacube and tiler modules

* adjust cloud filter

* return valid region of datacube pre-tiling

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix datacube processor (#43)

* Fix bugs introduced in PR #43

* some cosmetic updates

* add a catch for sampled dates which don't have S1 scenes within the +/- 3 day surrounding interval

* lower bad pixel percentage

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Wiesmann <yellowcap@users.noreply.github.com>
* 🚨 Fix line-length, boolean comparison and import errors

Fix linter errors:
- E501 Line too long
- E712 Comparison to `False` should be `cond is False` or `if not cond:`
- E402 Module level import not at top of file

* ✏️ Remove sys.path.append line
* ⬆️ Bump conda-lock from 2.4.2 to 2.5.1

Bumps [conda-lock](https://github.com/conda/conda-lock) from 2.4.2 to 2.5.1.
- [Release notes](https://github.com/conda/conda-lock/releases)
- [Commits](conda/conda-lock@v2.4.2...v2.5.1)

* ➕ Add fiona

Fiona reads and writes spatial data files!

* ➕ Add h5netcdf

Pythonic interface to netCDF4 via h5py!
* 📌 Pin to Pytorch 2.0 and CUDA 11.2

Somehow using the `--with-cuda=11.8` flag in conda-lock didn't work as expected to get the CUDA-built Pytorch instead of the CPU version. Temporarily downgrading from Pytorch 2.1 to 2.0 and CUDA 11.8 to 11.2, to make it possible to install torchvision=0.15.2 from conda-forge later.

* 🚧 Initial Vision Transformer architecture with MAE decoder

Initializing the neural network architecture layers, specifically a Vision Transformer (ViT) B/32 backbone and a Masked Autoencoder (MAE) decoder. Using Lightly for the MAE setup, with the ViT backbone from torchvision. Setup is mostly adapted from https://github.com/lightly-ai/lightly/blob/v1.4.21/examples/pytorch_lightning/mae.py

* ➕ Add transformers

State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow!

* 🏗️ Switch from torchvision to transformers ViTMAE

Changing from lightly/torchvision's ViTMAE implementation to HuggingFace transformers's ViTMAE. This allows us to configure the number of input channels to a number other than 3 (e.g. 12). However, transformer's ViTMAE is an all-in-one class rather than an Encoder/Decoder split (though there's a way to access either once the class is instantiated). Allowed for configuring the masking_ratio instead of the decoder_dim size, and removed the MSE loss because it is implemented in the ViTMAE class already.

* 👔 Implement forward pass and training_step

Run input images through the encoder and decoder, and compute the pixel reconstruction loss from training the Masked Autoencoder.

* ✅ Add unit test for MAELitModule

Ensure that running one training step on a mini-batch works. Created a random torch Dataset that generates tensors of shape (12, 256, 256) until there is real data to train on.

* 📌 Pin to CUDA 11.8

No need to pin to CUDA 11.2 since not using torchvision anymore. Patches 06535cd

* 🗃️ Increase input channels from 12 to 13

The datacube has 13 channels, namely 10 from Sentinel-2's 10m and 20m resolution bands, 2 from Sentinel-1's VV and VH, and 1 from the Copernicus DEM.

* 🐛 Remove hardcoded batch_size in assert statements

Use a variable self.B instead of hardcoding 32 as the batch_size in the assert statements checking the tensor shape, so that the last mini-batch with a size less than 32 can be seen by the model.

* 🚚 Rename to model_vit.py and ViTLitModule

Rename MAELitModule to ViTLitModule, and model.py to model_vit.py, since we might be trying out different neural network model architectures later.
Merging for testing on batch.

* Integrate tiler and s3 upload to data pipeline

* Remove unused file
⬆️ Bump pytorch from 2.0.0 to 2.1.0, CUDA from 11.8 to 12.0

Bumps [torch](https://github.com/pytorch/pytorch) from 2.0.0 to 2.1.0.
- [Release notes](https://github.com/pytorch/pytorch/releases)
- [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md)
- [Commits](pytorch/pytorch@v2.0.0...v2.1.0)

Also changing from the CUDA 11.8 build to the CUDA 12.0 build
* ➕ Add torchdata

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries!

* ♻️ Refactor test_model_vit to use datapipe fixture

Decoupling the neural network model's unit test from the LightningDataModule by implementing a standalone datapipe fixture instead.

* ✨ Implement GeoTIFFDataPipeModule

Create a LightningDataModule to load GeoTIFF files. Uses torchdata to create the data pipeline. Using the FileLister DataPipe to iterate over *.tif files in the data/ folder, and do a random 80/20 split for the training and validation set. The GeoTIFF files are read into numpy.ndarrrays using rasterio, and converted to torch.Tensors with the default collate function. Using rasterio instead of rioxarray to reduce an extra layer of overhead in the data loading.

* 🧵 Allow configuring num_workers in DataLoader

Enable setting the number of subprocesses used for data loading. Default to 8 for now, but can be configured on LightningCLI using `python trainer.py fit --data.num_workers=8`.

* 📌 Install torchdata=0.7.1 from conda-forge instead of PyPI

Contains a build of torchdata that is pre-compiled with the correct AWSSDK extension, and won't result in errors like `ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?)`.

* 🔧 Allow configuring data path containing the GeoTIFF files

Enable setting the path to the folder containing the GeoTIFF data files. Defaults to data/ for now, but can be configured on LightningCLI using `python trainer.py fit --data.data_path=data/56HKH`. Also setting the recursive=True flag to allow for files in nested directories.

* ✅ Add unit test for GeoTIFFDataModule

Ensure that loading one mini-batch of data from a data folder works. Created two temporary random GeoTIFF files containing arrays of shape (3, 256, 256) in a fixture for the test.
* 🔧 Configure ModelCheckpoint callback

Save model weights to a properly named checkpoint file like vit_epoch-09_train_loss-3250218.25.ckpt, stored in the checkpoints/ folder by default. More configuration can be done through LightningCLI, see `python trainer.py fit --trainer.callbacks.help=ModelCheckpoint`.

* ⚡ Add AsyncCheckpointIO plugin to trainer

Enable the experimental plugin that saves checkpoint files asynchronously in a thread. See https://lightning.ai/docs/pytorch/2.1.0/api/lightning.pytorch.plugins.io.AsyncCheckpointIO.html.
➕ Add rioxarray

Rasterio xarray extension! Repo at https://github.com/corteva/rioxarray
* 🍻 Generate embeddings via prediction loop

Implement the embedding generator in the LightningModule's predict_step. The embeddings are tensor arrays that are saved to a .npy file in the data/embeddings/ folder. Input data is retrieved from the predict_dataloader, which is currently using the validation datapipe rather than a dedicated datapipe. Have documented how to generate the embedding output file using LightningCLI on the main README.md file. Also added a unit test to ensure that saving and loading from an embedding_0.npy file works.

* 🐛 Disable masking of patches on predict_step

Previously, 75% of the patches, or 48 out of a total of 64 were masked out, leaving 16 patches plus 1 cls_token = 17 sequences. Disabling the mask gives 64 + 1 cls_token = 65 sequences. Moved some assert statements with a fixed sequence_length dim from the forward function to the training_step. Also updated the unit test to ensure output embeddings have a shape like (batch_size, 65, 768).

* ♻️ Refactor LightningDataModule to not do random split on predict

Refactoring the setup method in the LightningDataModule to not do a random split on the predict stage. I.e. just do the GeoTIFF to torch.Tensor conversion directly, followed by batching and collating.

* ✅ Test predict stage in geotiffdatamodule

Need to explicitly pass an argument to stage in the test_geotiffdatapipemodule unit test. Testing both the fit and predict stages.

* 👔 Ensure that embeddings have no NaN values

Make sure that the generated embeddings do not have NaN values in them.

* 🗃️ Take mean of the embeddings along sequence_length dim

Instead of saving embeddings of shape (1, 65, 768), save out embeddings of shape (1, 768) instead. Done by taking the mean along the sequence_length dim, except for the cls_token part (first index in the 65).
* Add bucket as argument to cli

* Improve efficency of datacube

Keep S2 in Uint16 as long as possible, subset using indexing instead of sel

* Simplify print statements

* Add and document batch setup

* Add sample as geopackage

Geojson was too big for the linter to be happy

* Small edit on README
* 🗃️ Let LightningDataModule return spatiotemporal metadata

Making the LightningDataModule return not only the image, but also spatiotemporal metadata such as the bounding box, coordinate reference system, and date. The bbox and crs is in the raster image's native UTM projection for now, while the date is just a YYYY-MM-DD formatted string. Unit tests have been updated to ensure that the extra metadata is passed through.

* ♻️ Refactor test_geotiffdatapipemodule to use parametrization

Reduce duplicate code py using pytest.mark.parametrize, looping over fit and predict stages.

* 📝 Document returned outputs from _array_to_torch function

Improved the docstring of the _array_to_torch function, mentioning the input parameters (filepath) and the contents of the output dictionary (image, bbox, crs, date). Also updated the type hint of the function.

* 🚚 Rename crs to epsg

Since we're storing the EPSG integer and not the CRS representation.
* Remove default for subset

This is easy to forget and then run with subset without the intention to actually subset

* Improve file name

Closes #69

- Zero padding for counter
- v before version number
- Underscores instead of hyphon separators
- Drop hyphons from date stamp

* Bump version to 02

* Make mgrs sample file external

Closes #71

* Add date to raster metadata

Closes #70

* Improve print statement
* Create model-license.md

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Rename model-license.md to LICENSE-MODEL.md

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
… and DEM (#60)

* check for no data on a tile level in sentinel 1 vv and vh, sentinel 2 and DEM

* adjust to run consecutively instead of all together, prevents unnecesary calculations

* adjust per nodata type in other bands

* Simplify nodata check by converting to loop

---------

Co-authored-by: Daniel Wiesmann <daniel@wiesmann.pt>
* Improve date handling for data pipeline

If no match is found for a year, others are being tried until
a match is found or all years have been tested

* Increase tile size to 512x512 pixels.

Closes #78

* Increase dates per location to 3

Closes #79

* Prevent printing s3 sync upload progress logs

* Move counter above cloud filter to ensure index consistency

Like this the tile IDs in the file names should be consistent across dates.

* Fix typo in comment

* Update batch run setup to new bucket name
weiji14 and others added 13 commits December 8, 2023 13:36
* ✨ Save embeddings with spatiotemporal metadata to GeoParquet

Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back.

* 📝 Document how embeddings are generated and saved to geoparquet

Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like.

* 📝 Mention in main README.md that embeddings are saved to geoparquet

Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024.

* 🎨 Update type hint of batch inputs, and add some inline comments

Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.
* 🔧 Increase image_size from 256 to 512, patch_size from 32 to 64

Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo.

* 👽 Get YYYY-MM-DD from GeoTIFF tag instead of filename

Obtaining the YYYY-MM-DD date from the GeoTIFF's tag metadata, instead of parsing it from the filename, thanks to the change at 426aa06/#72.

* ✨ Allow GeoTIFFDataModule to get GeoTIFF data from an s3 bucket

New feature to allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Added a unit test that checks that this works to list a GeoTIFF file from s3://copernicus-dem-30m/. Also improved the docstring and type hint of the setup() function's 'stage' parameter.

* 🐛 Add sharding filter before loading GeoTIFF data to torch.Tensor

Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result.

* 🙈 Gitignore checkpoints in nested folders

Ensure that *.ckpt files in sub-folders are ignored too.

* ⚡ Set float32 matmul precision to medium

Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.`

* 📝 Mention in main README.md that data_path can be an s3 bucket

Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.
…88)

* ➕ Add wandb

A CLI and library for interacting with the Weights and Biases API!

* 🔊 Log Masked Autoencoder reconstructions to WandB

Created a custom callback function to log visualizations of the input and output images to the Masked Autoencoder. Only showing the RGB bands of Sentinel-2 for now. A sample of 6 image pairs (original + reconstructed, so 12 in total) is uploaded to Weights and Biases.

Example LightningCLI command: `python trainer.py fit --trainer.max_epochs=20 --data.data_path=data/32VLM --trainer.logger=WandbLogger --trainer.logger.project=clay --trainer.logger.save_dir=checkpoints --trainer.callbacks+=LogMAEReconstructedImage`.

* ➕ Add scikit-image

Image processing in Python!

* 📸 Apply histogram equalization to RGB images

Enhance low contrast images by applying a histogram equalization stretching algorithm on the RGB images, instead of dividing by a magic number like 6000.

* 🔧 Increase default sample size from 6 to 8

More samples to look at! Also only running einsum conversion on as many samples as needed rather than the whole batch, and handling cases where num_samples may be more than the batch_size.

* 🧑‍💻 Make wandb a somewhat optional dependency

Allows for `from src.callback_wandb import LogMAEReconstruction` to run, even without wandb being installed. Helpful if someone doesn't want to install wandb for whatever reason.

* ✅ Add unit test for LogMAEReconstruction

Testing that the LogMAEReconstruction callback works to save a set of images to WandB. Testing this in offline mode only, with checks that artifacts are saved locally, and that the wandb images have the correct caption and format.

* 🐛 Compare expected folders using set instead of list

Order of the folders could change, so using set instead of list.

* 🧪 Prevent WandB logger from saving logs to local drive for now

Setting WANDB_MODE="disabled", so no files are logged to disk, though the wandb.Image(s) are still created. See if this helps to resolve the exit code 255 issue on GitHub Actions.

* 📝 Fix a typo and improve docstring

Minor changes to the docstring of the on_validation_batch_end method, and a typo fix.
…dings (#47)

* Add modified ViT to encode latlon, time, channels & position embeddings

* Add MAE for modified ViT

* Add docstrings & fix issue with complex indexing

* Fix the comments on loss computation

* Add datamodule & trainer to run an epoch of training

* Normalize data before feeding to the model

* Add fixed sincos embedding for position & bands

* Add logging & ckpt options

* Fix the order of coords from lat,lon to lon,lat

* Add clay tiny,small,medium,large model versions

* Remove hardcoded patch size in LogIntermediatePredictions callback

Retrieve the patch size value from the model architecture, rather than hardcoding as 32. Also ensure that the input image shape is the same as the predicted image from the decoder.

* Run clay small on image size 512 for 10 epochs with grad_acc

* Make the clay construction configurable

* Return the data path to reference for vector embeddings

* Remove duplicate dataset.py & geovit.py

* 🔀 Merge srm_trainer.py into trainer.py

Have one entrypoint to run the model using Lightning CLI. Switched model from VitLitModule to CLAYModule, and datamodule from GeoTIFFDataPipeModule to ClayDataModule. Temporarily disabling the logging and monitoring callbacks for now.

* 🔀 Combine clay.py and model.py into model_clay

Putting the CLAYModule (LightningModule) together with the CLAY torch.nn.Module in a single model_clay.py file. Have mentioned in src/README.md that model_clay.py is the one with custom spatiotemporal encoders, while the previous model_vit.py contains vanilla Vision Transformer implementation.

* ➕ Add matplotlib-base

Publication quality figures in Python!

* 🚚 Move ClayDataset and ClayDataModule into datamodule.py

Putting the DataLoader code in one file - datamodule.py. The regular torch Dataset classes are placed on top of the existing torchdata-based functions/classes.

* 🚚 Move LogIntermediatePredictions callback into callbacks_wandb

Moving the LogIntermediatePredictions callback class from callbacks.py into callbacks_wandb.py.

* ♻️ Get WandB logger properly using a shared function

Getting the WandbLogger directly from the trainer, rather than having to pass it through __init__. Adapted from https://github.com/ashleve/lightning-hydra-template/blob/334601c0326a50ff301fbd76057b36408cf97ffa/src/callbacks/wandb_callbacks.py#L16C1-L34C6

* 🚨 Wrap docstring and fix too-many-arguments lint error

Converted docstrings from numpydoc style which uses less horizontal space but more vertical space. Also added a noqa comment for three instances of `PLR0913 Too many arguments in function definition`.

---------

Co-authored-by: SRM <soumya@developmentseed.org>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
* ➕ Add jupyter-book

Build a book with Jupyter Notebooks and Sphinx!

* 📝 Initialize Jupyter Book

Starting with a minimally modified Jupyter Book initialized with `jupyter-book create docs/`. Changed the `_config.yml` to use a proper title and the Clay logo. Included a Binder launch button and a footer with CC-BY-4.0 license.

Deleted the sample notebooks.ipynb and markdown-notebooks.md files, and excluded the book/requirements.txt (dependencies will be installed from environment.yml). Put in a placeholder installation page for now in the Table of Contents.

* 🚀 Deploy Jupyter Book to GitHub Pages via GitHub Actions

Continuous Integration workflow to build the Jupyter Book's html pages and publish it online to GitHub Pages. Based on https://jupyterbook.org/en/stable/publish/gh-pages.html#automatically-host-your-book-with-github-actions, but modernized to use GitHub Actions based publishing source, see https://github.blog/changelog/2022-07-27-github-pages-custom-github-actions-workflows-beta

* 📝 Add 'About Clay' section with links to GitHub and LinkedIn pages

Add external links to Clay's GitHub organization page, and LinkedIn.

* 🔍 Add badges to main README.md

Add badges pointing to the Jupyter Book page, and for the deploy-book.yml/test.yml GitHub Action statuses below the title in the main README.md page. Also modified the description into something more compelling.
…url (#86)

* 🗃️ Store source_url of GeoTIFF to GeoParquet file

Passing the URL or path of the GeoTIFF file through the datapipe, and into the model's prediction loop. The geopandas.GeoDataFrame now has an extra 'source_url' string column, and this is saved to the GeoParquet file too.

* 🚚 Save one GeoParquet file for each unique MGRS tile

For each MGRS code (e.g. 12ABC), save a GeoParquet file with a name formatted like `{MGRS:5}_v{VERSION:2}.gpq`, e.g. 12ABC_v01.gpq. Have updated the unit test to check that rows with different MGRS codes are saved to different files.

* ⚡ Save GeoParquet file with ZSTD compression

Using ZStandard compression instead of Parquet's default Snappy compression. Should result is slightly smaller filesizes, and slightly faster data transfer and compression (especially over the network). Also changed an assert statement to an if-then-raise instead.

* ♻️ Predict with multiple workers and gather results to save

Speed up embedding generation by enabling multiple workers to fetch and load mini-batches of GeoTIFF files independently, and run the prediction. The prediction or generated embeddings from each worker (a geopandas.GeoDataFrame) is then concatenated together row-wise, before getting passed to the GeoParquet output script. This is done via LightningModule's `on_predict_epoch_end` hook. Also documented these new processing steps in the docstring.
…dule (#91)

* 🔧 Standardize on a data_dir parameter with a str type

The GeoTIFFDataModule was using data_path:str, while ClayDataModule was using data_dir:Path. Standardize both to be data_dir:str instead. Some parts of this commit is adapted from 1009697.

Also placed all the ClayDataModule's setup logic under `if stage=='fit'`, to reduce diff when predict step is implemented later.

* 🎨 Get YYYY-MM-DD from GeoTIFF tag rather than filename

More robust way of obtaining the Sentinel-2 imagery's acquisition date. Also returning the date in the datacube now.

* 🎨 Simplify lonlat centroid calculation and return UTM bbox/epsg

Can use rasterio's built-in lnglat() method to get the geographic center of the chip, instead of calculating it manually. See https://rasterio.readthedocs.io/en/latest/api/rasterio._base.html#rasterio._base.DatasetBase.lnglat Also returning the original UTM bounding box and EPSG code in the datacube.
* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket

Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85.

* 🚚 Rename datacube's path key to source_url

Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule

* 🚑 Use try-except to get absolute chip_path or fallback to str

The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs).

* ✨ Implement predict_dataloader for ClayDataModule

Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled.

* ✅ Add parametrized test for checking ClayDataModule

Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted.

* 📝 Edit docstrings in test_datamodule.py to be more generic

Not just testing one, but two different LightningDataModules now!

* 🔧 Add GDAL environment variables that might help with s3 loading

Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.
…#95)

* ♻️ Better handle pos and band encodings across multi-devices

Move the pos_encoding and band_encoding layers to the correct device in a way that allow Lightning to do multi-gpu properly. The reported loss is now synced or reduced/averaged across multiple devices too. Partially cherry-picked from 1a40f56

Co-Authored-By: SRM <soumya@developmentseed.org>

* ♻️ Compute num_masked_patches dynamically based on mask_ratio

So that the masking can be turned off during prediction using `self.model.encoder.mask_ratio = 0`, where self is an instance of CLAYModule. The num_masked_patches integer value is now calculated on-the-fly by multiplying mask_ratio with num_patches.

* 🎨 Register pos_encoding and band_encoding properly on device

Since the pos_encoding and band_encoding tensors are declared in the __init__ method, we'll need to register them so that they are moved to the correct device by Lightning during the forward call. See https://lightning.ai/docs/pytorch/2.1.0/starter/converting.html#remove-any-cuda-or-to-device-calls

---------

Co-authored-by: SRM <soumya@developmentseed.org>
* 📝 Document how to generate vector embeddings

Step by step instructions on how to produce embeddings from the pretrained model. From checking that one has permissions to get the GeoTIFF files, to downloading of the model checkpoint, and running the model prediction to get the GeoParquet output. Also gave a tip on what a suitable VM instance would be like.

* 📝 Document details of how the mean embeddings were computed

Extra technical details on how the raw (B, 1538, 768) embeddings are turned into (B, 768) shaped embeddings by taking the mean along the spatial patches.

* 📝 Document format of the GeoParquet table and how to read it

Useful details about the filename convention and table schema of the embeddings stored in GeoParquet format, and some sample GeoPandas code showing how to read a *.gpq file. Also linking to some guides and resources from the Cloud Native Geospatial Foundation.

* ✏️ Typo embedding -> embeddings

Never sure whether it's singular or plural.
* 📝 Document how to finetune pretrained model on downstrem task

Explaining how the pre-trained model can be finetuned after attaching a head to the network. Written by Lilly.

---------

Co-authored-by: Lilly Thomas <lilly@developmentseed.org>
📝 Document how the benchmark dataset labels were prepared

Mention why we decided to use Cloud to Street for the initial benchmark dataset, and how the imagery and label data was processed to fit into the Clay Foundation model. Written by Lilly.

Co-authored-by: Lilly Thomas <lilly@developmentseed.org>
@brunosan brunosan self-assigned this Dec 26, 2023
@weiji14 weiji14 marked this pull request as draft December 26, 2023 23:59
Copy link
Contributor

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brunosan, we do have experience with Quarto (what nbdev is using behind the scenes), but what you are doing here can actually be achieved with Jupyter Book as well (see https://jupyterbook.org/en/stable/content/execute.html). That said, even if we decide to go with nbdev/Quarto, this is a big PR, and really it should be split into several parts. See comments below.

Copy link
Contributor

@weiji14 weiji14 Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • All functions and Classes created can be pip installed, making it a small SDK for Clay.

setup.py is being superseded by pyproject.toml (see e.g. https://snarky.ca/what-the-heck-is-pyproject-toml/), and we really should migrate to that instead.

Also, the way you are structuring the code to make it pip installable is not ideal, because everything is in the docs/ folder. The code should be packaged from the src/ folder instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK nbdev does not use pyproject.toml there's an open PR but seems abandoned unmerged. If we use nbdev we might need to go with setup.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These HTML files should be rendered on the fly, and not committed to git!!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I'm tryinh to fix it and probably best to abandon this PR and start fresh with same content, not adding [most] HTML files. I still think in some cases we might need to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with these .js (and .css) files, they should not be in git history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release notes (and the roadmap) doesn't need to be in a Jupyter Notebook, both Quarto and Jupyter Book allows for Markdown - https://quarto.org/docs/output-formats/gfm.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the option of providing both the html for easy viewing and the actual notebook so users can fork from it directly for their work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These PNG images should also be generated on the fly from the notebook, rather than committed directly to git history.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes in most cases, but sometimes, specially for the "Release notes", some of the calculations are quite heavy, so I think we should indeed commit those outputs in those cases.

I.e. most outputs are rendered on the fly, but some heavy calculations as not. nbdev allows to do that, by just adding a cell at the top with:

---
skip_exec: true
---

clayground_db.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is the vector database playground (a.k.a clayground) code meant to be in this Clay-foundation/model repo too? I know we demo-ed this last week as a proof of concept, but wasn't @Clay-foundation/ode supposed to work on this in a separate repo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct I'll remove it.

This is fully Soumyas work just pushing here for simplicity to share at the moment.

squash fix

nbdev init

added basic read

test rendered

Start release doc

rendered docs

better name for lib

WIP v0 release docs

rendered outout

ignore local sync

old docs fully ported

WIP

generated output

generated output

output generated

test new nbdev

geenraed output

test custom deploy of nbdev docs

add jobs

no description

no inputs

add geopandas

add tqdm

more deps

small cleanup

docs are inside /docs

weird relatives paths

WUP

rel path

trying new workflow script

write permission

bad params

dark color top

I think I can ignore locally generated static content

dark

rename core to model

renamed

GH pages setting

new workflow

path

path

checkout workflow

paths

../

actions

action

docs/_docs/

ignore rendered output

ignore rendered output

stub

clearer

move relase to roadmap, fix outputs

add image

Revert "Add minimal version of clayground"

This reverts commit 6e2aac2.

Remove all files from docs/docs

leave root unchanged
@brunosan
Copy link
Member Author

too many errors on this branch. Abandoning it and restarting over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants