Nbdev for Clay Documentation, Clay SDK and Clay notebooks all at once #102

brunosan · 2023-12-26T21:42:36Z

I've started to work on documentation. I decided to start over using nbdev

It's a bit convoluted and I'm over my head coding here, but it seems highly attractive to do all this all at once without extra work:

Explore the model with notebooks anyone can run.
All functions and Classes created can be pip installed, making it a small SDK for Clay.
Notebooks help generate documentation website for all functions, with links to the code
Documentation and sample code tests run on CI.

@weiji14 your work is still in the history, and we can revert to that. I've migrated your text to this PR.

It's early, but I wanted to give a heads up in case others have experience on this.

GH Action is already rendering a draft site on https://clay-foundation.github.io/model/

I've also focused on the v0 release notes, but we still miss a lot of information. WIP.

* Initial conda environment and binder links Add conda dependency specification file and getting started instructions in the main README.md. The conda environment.yml is paired with a conda-lock.yml lockfile for full reproducibility. Main README.md contains contains quickstart buttons for Binder/Planetary Computer/SageMaker Studio Lab, and steps for installation and usage locally. * ➕ Add zarr An implementation of chunked, compressed, N-dimensional arrays for Python! Repo at https://github.com/zarr-developers/zarr-python

* ➕ Add jsonargparse[signatures] Parsing of command line options, yaml/jsonnet config files and/or environment variables based on argparse! Also adding the signatures extras (which includes typeshed-client). * 🌱 Setup LightningCLI trainer script Setting up the command-line interface to run Lightning. Created a placeholder BaseDataModule and BaseLitModule to hold the data pipeline and model architecture respectively under the src/ folder. Documented in the main README.md on how to run the LightningCLI commands, and also created a src/README.md to document what the python modules in that folder.

* 🙈 Add .gitignore file * 👷 Setup GitHub Actions Continuous Integration tests Running tests on Ubuntu-22.04 and Python 3.11 only for now. Add a parametrized test to ensure that `python trainer.py fit --print_config=skip_null` works (as well as the validate/test subcommands). Tests are ran using `python -m pytest src/tests/`.

* 🔧 Add pre-commit config with pre-commit-hooks and ruff Adding a .pre-commit-config.yaml file with some pre-commit hooks and the ruff linter/formatter. The ruff linter is configured to do autofix, and will run on python scripts and jupyter notebooks. * 🔧 Configure ruff rules with pyproject.toml file Enforce certain ruff formatting and lint rules such as UNIX-style line-endings, pycodestyle, pyflakes, isort, numpy, pylint and pyupgrade. * 🚨 Fix F841 by returning cli variable Fix `F841: Local variable `cli` is assigned to but never used` on trainer.py. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 🔧 Configure pre-commit.ci to run autoupdates on quarterly basis Setting up pre-commit.ci to only run updates quarterly instead of the weekly default. Also explicitly stating that Pull Requests will be autofixed. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Geographic pandas extensions!

* combining data arrays into multi-sensor data cube * implement cql2-json filters for S2 and S1 * script to generate merged datacube * wip function for calculating sentinel 1 scene with max coverage of bbox * formatting * move script and remove notebook * use geom instead of geodataframe for initial aoi * use args in funcs * use CENTROID for geom in cql2 query * Use mosaic method, set singular time dimension based on Sentinel 2 * add configurable args for cloud cover percentage and nodata percentage * use epsg code derived from Sentinel-2 properties, filter by best cloud-free conditions and orbit state * remove extra filter * map s2 for best image using datetime to id, set s2 bands as unique vars, mosaic s1 on time * assign S2 time as dataset variable * remove orbit filter * wrap example in main * move script to subdir * use cloud variable * use cloud variable * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* 🚨 Fix E501 Line too long Wrapping docstrings in scripts/datacube.py to under 88 characters. * ♻️ Refactor best_nodata and best_clouds into single sort function Fixes F841 Local variable `best_nodata` is assigned to but never used. Only the best_clouds variable was used, and best_nodata was omitted, but both should be used. Doing this in a single pandas sort_values function. * 🚑 Quickfix with getting the STAC item with a specific datetime Patch cc99ae4 * 🏷️ Rename variables to ds_ (xr.Dataset) or da (xr.DataArray) Using ds_ prefix for xr.Dataset objects, and da_ prefix for xr.DataArray objects. * 🔧 Set pylint max-args to 6 Increase from default value of 5 to 6, * 🗑️ Replace .get_all_items() with .item_collection() Fixes `FutureWarning: get_all_items() is deprecated, use item_collection() instead`. * 🔥 Remove sorting by nodata and just sort by least cloud cover No need to sort by `s2:nodata_pixel_percentage` anymore, just get the Sentinel-2 STAC item with the least cloud cover. * 📝 More DataArray to Dataset renames Missed a few more da_ to ds_ renames, following from 2af24be

* Add landcover based sampling scripts Closes #28 * Drop duplicates, fix typo, uncomment compute_stats function. * Fix comment that was out of sync with code

replace placeholder with our name and date

* Initial tile module that generates 256x256 tiled xarray datasets from the larger scene-level datacube * update comments * update comments * add doctsings and initial cloud and nodata filter * more efficient cloud and nodata filter * example script to run the datacube and tiler modules * adjust cloud filter * return valid region of datacube pre-tiling * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix datacube processor (#43) * Fix bugs introduced in PR #43 * some cosmetic updates * add a catch for sampled dates which don't have S1 scenes within the +/- 3 day surrounding interval * lower bad pixel percentage * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Wiesmann <yellowcap@users.noreply.github.com>

* 🚨 Fix line-length, boolean comparison and import errors Fix linter errors: - E501 Line too long - E712 Comparison to `False` should be `cond is False` or `if not cond:` - E402 Module level import not at top of file * ✏️ Remove sys.path.append line

* ⬆️ Bump conda-lock from 2.4.2 to 2.5.1 Bumps [conda-lock](https://github.com/conda/conda-lock) from 2.4.2 to 2.5.1. - [Release notes](https://github.com/conda/conda-lock/releases) - [Commits](conda/conda-lock@v2.4.2...v2.5.1) * ➕ Add fiona Fiona reads and writes spatial data files! * ➕ Add h5netcdf Pythonic interface to netCDF4 via h5py!

* 📌 Pin to Pytorch 2.0 and CUDA 11.2 Somehow using the `--with-cuda=11.8` flag in conda-lock didn't work as expected to get the CUDA-built Pytorch instead of the CPU version. Temporarily downgrading from Pytorch 2.1 to 2.0 and CUDA 11.8 to 11.2, to make it possible to install torchvision=0.15.2 from conda-forge later. * 🚧 Initial Vision Transformer architecture with MAE decoder Initializing the neural network architecture layers, specifically a Vision Transformer (ViT) B/32 backbone and a Masked Autoencoder (MAE) decoder. Using Lightly for the MAE setup, with the ViT backbone from torchvision. Setup is mostly adapted from https://github.com/lightly-ai/lightly/blob/v1.4.21/examples/pytorch_lightning/mae.py * ➕ Add transformers State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow! * 🏗️ Switch from torchvision to transformers ViTMAE Changing from lightly/torchvision's ViTMAE implementation to HuggingFace transformers's ViTMAE. This allows us to configure the number of input channels to a number other than 3 (e.g. 12). However, transformer's ViTMAE is an all-in-one class rather than an Encoder/Decoder split (though there's a way to access either once the class is instantiated). Allowed for configuring the masking_ratio instead of the decoder_dim size, and removed the MSE loss because it is implemented in the ViTMAE class already. * 👔 Implement forward pass and training_step Run input images through the encoder and decoder, and compute the pixel reconstruction loss from training the Masked Autoencoder. * ✅ Add unit test for MAELitModule Ensure that running one training step on a mini-batch works. Created a random torch Dataset that generates tensors of shape (12, 256, 256) until there is real data to train on. * 📌 Pin to CUDA 11.8 No need to pin to CUDA 11.2 since not using torchvision anymore. Patches 06535cd * 🗃️ Increase input channels from 12 to 13 The datacube has 13 channels, namely 10 from Sentinel-2's 10m and 20m resolution bands, 2 from Sentinel-1's VV and VH, and 1 from the Copernicus DEM. * 🐛 Remove hardcoded batch_size in assert statements Use a variable self.B instead of hardcoding 32 as the batch_size in the assert statements checking the tensor shape, so that the last mini-batch with a size less than 32 can be seen by the model. * 🚚 Rename to model_vit.py and ViTLitModule Rename MAELitModule to ViTLitModule, and model.py to model_vit.py, since we might be trying out different neural network model architectures later.

Merging for testing on batch. * Integrate tiler and s3 upload to data pipeline * Remove unused file

⬆️ Bump pytorch from 2.0.0 to 2.1.0, CUDA from 11.8 to 12.0 Bumps [torch](https://github.com/pytorch/pytorch) from 2.0.0 to 2.1.0. - [Release notes](https://github.com/pytorch/pytorch/releases) - [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md) - [Commits](pytorch/pytorch@v2.0.0...v2.1.0) Also changing from the CUDA 11.8 build to the CUDA 12.0 build

* ➕ Add torchdata A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries! * ♻️ Refactor test_model_vit to use datapipe fixture Decoupling the neural network model's unit test from the LightningDataModule by implementing a standalone datapipe fixture instead. * ✨ Implement GeoTIFFDataPipeModule Create a LightningDataModule to load GeoTIFF files. Uses torchdata to create the data pipeline. Using the FileLister DataPipe to iterate over *.tif files in the data/ folder, and do a random 80/20 split for the training and validation set. The GeoTIFF files are read into numpy.ndarrrays using rasterio, and converted to torch.Tensors with the default collate function. Using rasterio instead of rioxarray to reduce an extra layer of overhead in the data loading. * 🧵 Allow configuring num_workers in DataLoader Enable setting the number of subprocesses used for data loading. Default to 8 for now, but can be configured on LightningCLI using `python trainer.py fit --data.num_workers=8`. * 📌 Install torchdata=0.7.1 from conda-forge instead of PyPI Contains a build of torchdata that is pre-compiled with the correct AWSSDK extension, and won't result in errors like `ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?)`. * 🔧 Allow configuring data path containing the GeoTIFF files Enable setting the path to the folder containing the GeoTIFF data files. Defaults to data/ for now, but can be configured on LightningCLI using `python trainer.py fit --data.data_path=data/56HKH`. Also setting the recursive=True flag to allow for files in nested directories. * ✅ Add unit test for GeoTIFFDataModule Ensure that loading one mini-batch of data from a data folder works. Created two temporary random GeoTIFF files containing arrays of shape (3, 256, 256) in a fixture for the test.

* 🔧 Configure ModelCheckpoint callback Save model weights to a properly named checkpoint file like vit_epoch-09_train_loss-3250218.25.ckpt, stored in the checkpoints/ folder by default. More configuration can be done through LightningCLI, see `python trainer.py fit --trainer.callbacks.help=ModelCheckpoint`. * ⚡ Add AsyncCheckpointIO plugin to trainer Enable the experimental plugin that saves checkpoint files asynchronously in a thread. See https://lightning.ai/docs/pytorch/2.1.0/api/lightning.pytorch.plugins.io.AsyncCheckpointIO.html.

Using the Contributor Covenant Code of Conduct from https://www.contributor-covenant.org/version/2/1/code_of_conduct.html

➕ Add rioxarray Rasterio xarray extension! Repo at https://github.com/corteva/rioxarray

* 🍻 Generate embeddings via prediction loop Implement the embedding generator in the LightningModule's predict_step. The embeddings are tensor arrays that are saved to a .npy file in the data/embeddings/ folder. Input data is retrieved from the predict_dataloader, which is currently using the validation datapipe rather than a dedicated datapipe. Have documented how to generate the embedding output file using LightningCLI on the main README.md file. Also added a unit test to ensure that saving and loading from an embedding_0.npy file works. * 🐛 Disable masking of patches on predict_step Previously, 75% of the patches, or 48 out of a total of 64 were masked out, leaving 16 patches plus 1 cls_token = 17 sequences. Disabling the mask gives 64 + 1 cls_token = 65 sequences. Moved some assert statements with a fixed sequence_length dim from the forward function to the training_step. Also updated the unit test to ensure output embeddings have a shape like (batch_size, 65, 768). * ♻️ Refactor LightningDataModule to not do random split on predict Refactoring the setup method in the LightningDataModule to not do a random split on the predict stage. I.e. just do the GeoTIFF to torch.Tensor conversion directly, followed by batching and collating. * ✅ Test predict stage in geotiffdatamodule Need to explicitly pass an argument to stage in the test_geotiffdatapipemodule unit test. Testing both the fit and predict stages. * 👔 Ensure that embeddings have no NaN values Make sure that the generated embeddings do not have NaN values in them. * 🗃️ Take mean of the embeddings along sequence_length dim Instead of saving embeddings of shape (1, 65, 768), save out embeddings of shape (1, 768) instead. Done by taking the mean along the sequence_length dim, except for the cls_token part (first index in the 65).

* Add bucket as argument to cli * Improve efficency of datacube Keep S2 in Uint16 as long as possible, subset using indexing instead of sel * Simplify print statements * Add and document batch setup * Add sample as geopackage Geojson was too big for the linter to be happy * Small edit on README

* 🗃️ Let LightningDataModule return spatiotemporal metadata Making the LightningDataModule return not only the image, but also spatiotemporal metadata such as the bounding box, coordinate reference system, and date. The bbox and crs is in the raster image's native UTM projection for now, while the date is just a YYYY-MM-DD formatted string. Unit tests have been updated to ensure that the extra metadata is passed through. * ♻️ Refactor test_geotiffdatapipemodule to use parametrization Reduce duplicate code py using pytest.mark.parametrize, looping over fit and predict stages. * 📝 Document returned outputs from _array_to_torch function Improved the docstring of the _array_to_torch function, mentioning the input parameters (filepath) and the contents of the output dictionary (image, bbox, crs, date). Also updated the type hint of the function. * 🚚 Rename crs to epsg Since we're storing the EPSG integer and not the CRS representation.

* Remove default for subset This is easy to forget and then run with subset without the intention to actually subset * Improve file name Closes #69 - Zero padding for counter - v before version number - Underscores instead of hyphon separators - Drop hyphons from date stamp * Bump version to 02 * Make mgrs sample file external Closes #71 * Add date to raster metadata Closes #70 * Improve print statement

* Create model-license.md * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename model-license.md to LICENSE-MODEL.md --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

… and DEM (#60) * check for no data on a tile level in sentinel 1 vv and vh, sentinel 2 and DEM * adjust to run consecutively instead of all together, prevents unnecesary calculations * adjust per nodata type in other bands * Simplify nodata check by converting to loop --------- Co-authored-by: Daniel Wiesmann <daniel@wiesmann.pt>

* Improve date handling for data pipeline If no match is found for a year, others are being tried until a match is found or all years have been tested * Increase tile size to 512x512 pixels. Closes #78 * Increase dates per location to 3 Closes #79 * Prevent printing s3 sync upload progress logs * Move counter above cloud filter to ensure index consistency Like this the tile IDs in the file names should be consistent across dates. * Fix typo in comment * Update batch run setup to new bucket name

* ✨ Save embeddings with spatiotemporal metadata to GeoParquet Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back. * 📝 Document how embeddings are generated and saved to geoparquet Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like. * 📝 Mention in main README.md that embeddings are saved to geoparquet Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024. * 🎨 Update type hint of batch inputs, and add some inline comments Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.

* 🔧 Increase image_size from 256 to 512, patch_size from 32 to 64 Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo. * 👽 Get YYYY-MM-DD from GeoTIFF tag instead of filename Obtaining the YYYY-MM-DD date from the GeoTIFF's tag metadata, instead of parsing it from the filename, thanks to the change at 426aa06/#72. * ✨ Allow GeoTIFFDataModule to get GeoTIFF data from an s3 bucket New feature to allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Added a unit test that checks that this works to list a GeoTIFF file from s3://copernicus-dem-30m/. Also improved the docstring and type hint of the setup() function's 'stage' parameter. * 🐛 Add sharding filter before loading GeoTIFF data to torch.Tensor Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result. * 🙈 Gitignore checkpoints in nested folders Ensure that *.ckpt files in sub-folders are ignored too. * ⚡ Set float32 matmul precision to medium Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.` * 📝 Mention in main README.md that data_path can be an s3 bucket Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.

…88) * ➕ Add wandb A CLI and library for interacting with the Weights and Biases API! * 🔊 Log Masked Autoencoder reconstructions to WandB Created a custom callback function to log visualizations of the input and output images to the Masked Autoencoder. Only showing the RGB bands of Sentinel-2 for now. A sample of 6 image pairs (original + reconstructed, so 12 in total) is uploaded to Weights and Biases. Example LightningCLI command: `python trainer.py fit --trainer.max_epochs=20 --data.data_path=data/32VLM --trainer.logger=WandbLogger --trainer.logger.project=clay --trainer.logger.save_dir=checkpoints --trainer.callbacks+=LogMAEReconstructedImage`. * ➕ Add scikit-image Image processing in Python! * 📸 Apply histogram equalization to RGB images Enhance low contrast images by applying a histogram equalization stretching algorithm on the RGB images, instead of dividing by a magic number like 6000. * 🔧 Increase default sample size from 6 to 8 More samples to look at! Also only running einsum conversion on as many samples as needed rather than the whole batch, and handling cases where num_samples may be more than the batch_size. * 🧑‍💻 Make wandb a somewhat optional dependency Allows for `from src.callback_wandb import LogMAEReconstruction` to run, even without wandb being installed. Helpful if someone doesn't want to install wandb for whatever reason. * ✅ Add unit test for LogMAEReconstruction Testing that the LogMAEReconstruction callback works to save a set of images to WandB. Testing this in offline mode only, with checks that artifacts are saved locally, and that the wandb images have the correct caption and format. * 🐛 Compare expected folders using set instead of list Order of the folders could change, so using set instead of list. * 🧪 Prevent WandB logger from saving logs to local drive for now Setting WANDB_MODE="disabled", so no files are logged to disk, though the wandb.Image(s) are still created. See if this helps to resolve the exit code 255 issue on GitHub Actions. * 📝 Fix a typo and improve docstring Minor changes to the docstring of the on_validation_batch_end method, and a typo fix.

…dings (#47) * Add modified ViT to encode latlon, time, channels & position embeddings * Add MAE for modified ViT * Add docstrings & fix issue with complex indexing * Fix the comments on loss computation * Add datamodule & trainer to run an epoch of training * Normalize data before feeding to the model * Add fixed sincos embedding for position & bands * Add logging & ckpt options * Fix the order of coords from lat,lon to lon,lat * Add clay tiny,small,medium,large model versions * Remove hardcoded patch size in LogIntermediatePredictions callback Retrieve the patch size value from the model architecture, rather than hardcoding as 32. Also ensure that the input image shape is the same as the predicted image from the decoder. * Run clay small on image size 512 for 10 epochs with grad_acc * Make the clay construction configurable * Return the data path to reference for vector embeddings * Remove duplicate dataset.py & geovit.py * 🔀 Merge srm_trainer.py into trainer.py Have one entrypoint to run the model using Lightning CLI. Switched model from VitLitModule to CLAYModule, and datamodule from GeoTIFFDataPipeModule to ClayDataModule. Temporarily disabling the logging and monitoring callbacks for now. * 🔀 Combine clay.py and model.py into model_clay Putting the CLAYModule (LightningModule) together with the CLAY torch.nn.Module in a single model_clay.py file. Have mentioned in src/README.md that model_clay.py is the one with custom spatiotemporal encoders, while the previous model_vit.py contains vanilla Vision Transformer implementation. * ➕ Add matplotlib-base Publication quality figures in Python! * 🚚 Move ClayDataset and ClayDataModule into datamodule.py Putting the DataLoader code in one file - datamodule.py. The regular torch Dataset classes are placed on top of the existing torchdata-based functions/classes. * 🚚 Move LogIntermediatePredictions callback into callbacks_wandb Moving the LogIntermediatePredictions callback class from callbacks.py into callbacks_wandb.py. * ♻️ Get WandB logger properly using a shared function Getting the WandbLogger directly from the trainer, rather than having to pass it through __init__. Adapted from https://github.com/ashleve/lightning-hydra-template/blob/334601c0326a50ff301fbd76057b36408cf97ffa/src/callbacks/wandb_callbacks.py#L16C1-L34C6 * 🚨 Wrap docstring and fix too-many-arguments lint error Converted docstrings from numpydoc style which uses less horizontal space but more vertical space. Also added a noqa comment for three instances of `PLR0913 Too many arguments in function definition`. --------- Co-authored-by: SRM <soumya@developmentseed.org> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>

* ➕ Add jupyter-book Build a book with Jupyter Notebooks and Sphinx! * 📝 Initialize Jupyter Book Starting with a minimally modified Jupyter Book initialized with `jupyter-book create docs/`. Changed the `_config.yml` to use a proper title and the Clay logo. Included a Binder launch button and a footer with CC-BY-4.0 license. Deleted the sample notebooks.ipynb and markdown-notebooks.md files, and excluded the book/requirements.txt (dependencies will be installed from environment.yml). Put in a placeholder installation page for now in the Table of Contents. * 🚀 Deploy Jupyter Book to GitHub Pages via GitHub Actions Continuous Integration workflow to build the Jupyter Book's html pages and publish it online to GitHub Pages. Based on https://jupyterbook.org/en/stable/publish/gh-pages.html#automatically-host-your-book-with-github-actions, but modernized to use GitHub Actions based publishing source, see https://github.blog/changelog/2022-07-27-github-pages-custom-github-actions-workflows-beta * 📝 Add 'About Clay' section with links to GitHub and LinkedIn pages Add external links to Clay's GitHub organization page, and LinkedIn. * 🔍 Add badges to main README.md Add badges pointing to the Jupyter Book page, and for the deploy-book.yml/test.yml GitHub Action statuses below the title in the main README.md page. Also modified the description into something more compelling.

…url (#86) * 🗃️ Store source_url of GeoTIFF to GeoParquet file Passing the URL or path of the GeoTIFF file through the datapipe, and into the model's prediction loop. The geopandas.GeoDataFrame now has an extra 'source_url' string column, and this is saved to the GeoParquet file too. * 🚚 Save one GeoParquet file for each unique MGRS tile For each MGRS code (e.g. 12ABC), save a GeoParquet file with a name formatted like `{MGRS:5}_v{VERSION:2}.gpq`, e.g. 12ABC_v01.gpq. Have updated the unit test to check that rows with different MGRS codes are saved to different files. * ⚡ Save GeoParquet file with ZSTD compression Using ZStandard compression instead of Parquet's default Snappy compression. Should result is slightly smaller filesizes, and slightly faster data transfer and compression (especially over the network). Also changed an assert statement to an if-then-raise instead. * ♻️ Predict with multiple workers and gather results to save Speed up embedding generation by enabling multiple workers to fetch and load mini-batches of GeoTIFF files independently, and run the prediction. The prediction or generated embeddings from each worker (a geopandas.GeoDataFrame) is then concatenated together row-wise, before getting passed to the GeoParquet output script. This is done via LightningModule's `on_predict_epoch_end` hook. Also documented these new processing steps in the docstring.

…dule (#91) * 🔧 Standardize on a data_dir parameter with a str type The GeoTIFFDataModule was using data_path:str, while ClayDataModule was using data_dir:Path. Standardize both to be data_dir:str instead. Some parts of this commit is adapted from 1009697. Also placed all the ClayDataModule's setup logic under `if stage=='fit'`, to reduce diff when predict step is implemented later. * 🎨 Get YYYY-MM-DD from GeoTIFF tag rather than filename More robust way of obtaining the Sentinel-2 imagery's acquisition date. Also returning the date in the datacube now. * 🎨 Simplify lonlat centroid calculation and return UTM bbox/epsg Can use rasterio's built-in lnglat() method to get the geographic center of the chip, instead of calculating it manually. See https://rasterio.readthedocs.io/en/latest/api/rasterio._base.html#rasterio._base.DatasetBase.lnglat Also returning the original UTM bounding box and EPSG code in the datacube.

* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85. * 🚚 Rename datacube's path key to source_url Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule * 🚑 Use try-except to get absolute chip_path or fallback to str The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs). * ✨ Implement predict_dataloader for ClayDataModule Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled. * ✅ Add parametrized test for checking ClayDataModule Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted. * 📝 Edit docstrings in test_datamodule.py to be more generic Not just testing one, but two different LightningDataModules now! * 🔧 Add GDAL environment variables that might help with s3 loading Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.

…#95) * ♻️ Better handle pos and band encodings across multi-devices Move the pos_encoding and band_encoding layers to the correct device in a way that allow Lightning to do multi-gpu properly. The reported loss is now synced or reduced/averaged across multiple devices too. Partially cherry-picked from 1a40f56 Co-Authored-By: SRM <soumya@developmentseed.org> * ♻️ Compute num_masked_patches dynamically based on mask_ratio So that the masking can be turned off during prediction using `self.model.encoder.mask_ratio = 0`, where self is an instance of CLAYModule. The num_masked_patches integer value is now calculated on-the-fly by multiplying mask_ratio with num_patches. * 🎨 Register pos_encoding and band_encoding properly on device Since the pos_encoding and band_encoding tensors are declared in the __init__ method, we'll need to register them so that they are moved to the correct device by Lightning during the forward call. See https://lightning.ai/docs/pytorch/2.1.0/starter/converting.html#remove-any-cuda-or-to-device-calls --------- Co-authored-by: SRM <soumya@developmentseed.org>

* 📝 Document how to generate vector embeddings Step by step instructions on how to produce embeddings from the pretrained model. From checking that one has permissions to get the GeoTIFF files, to downloading of the model checkpoint, and running the model prediction to get the GeoParquet output. Also gave a tip on what a suitable VM instance would be like. * 📝 Document details of how the mean embeddings were computed Extra technical details on how the raw (B, 1538, 768) embeddings are turned into (B, 768) shaped embeddings by taking the mean along the spatial patches. * 📝 Document format of the GeoParquet table and how to read it Useful details about the filename convention and table schema of the embeddings stored in GeoParquet format, and some sample GeoPandas code showing how to read a *.gpq file. Also linking to some guides and resources from the Cloud Native Geospatial Foundation. * ✏️ Typo embedding -> embeddings Never sure whether it's singular or plural.

* 📝 Document how to finetune pretrained model on downstrem task Explaining how the pre-trained model can be finetuned after attaching a head to the network. Written by Lilly. --------- Co-authored-by: Lilly Thomas <lilly@developmentseed.org>

📝 Document how the benchmark dataset labels were prepared Mention why we decided to use Cloud to Street for the initial benchmark dataset, and how the imagery and label data was processed to fit into the Clay Foundation model. Written by Lilly. Co-authored-by: Lilly Thomas <lilly@developmentseed.org>

weiji14

@brunosan, we do have experience with Quarto (what nbdev is using behind the scenes), but what you are doing here can actually be achieved with Jupyter Book as well (see https://jupyterbook.org/en/stable/content/execute.html). That said, even if we decide to go with nbdev/Quarto, this is a big PR, and really it should be split into several parts. See comments below.

weiji14 · 2023-12-27T00:10:49Z

docs/setup.py

All functions and Classes created can be pip installed, making it a small SDK for Clay.

setup.py is being superseded by pyproject.toml (see e.g. https://snarky.ca/what-the-heck-is-pyproject-toml/), and we really should migrate to that instead.

Also, the way you are structuring the code to make it pip installable is not ideal, because everything is in the docs/ folder. The code should be packaged from the src/ folder instead.

AFAIK nbdev does not use pyproject.toml there's an open PR but seems abandoned unmerged. If we use nbdev we might need to go with setup.py

weiji14 · 2023-12-27T00:11:51Z

docs/docs/core.html

These HTML files should be rendered on the fly, and not committed to git!!

Agree. I'm tryinh to fix it and probably best to abandon this PR and start fresh with same content, not adding [most] HTML files. I still think in some cases we might need to.

weiji14 · 2023-12-27T00:12:20Z

docs/docs/site_libs/quarto-html/quarto.js

Same with these .js (and .css) files, they should not be in git history.

weiji14 · 2023-12-27T00:22:21Z

docs/nbs/Clay Model releases/Clay v0 release.ipynb

Release notes (and the roadmap) doesn't need to be in a Jupyter Notebook, both Quarto and Jupyter Book allows for Markdown - https://quarto.org/docs/output-formats/gfm.html

I do like the option of providing both the html for easy viewing and the actual notebook so users can fork from it directly for their work

weiji14 · 2023-12-27T00:24:27Z

docs/docs/01_embeddings_files/figure-html/cell-7-output-1.png

These PNG images should also be generated on the fly from the notebook, rather than committed directly to git history.

Yes in most cases, but sometimes, specially for the "Release notes", some of the calculations are quite heavy, so I think we should indeed commit those outputs in those cases.

I.e. most outputs are rendered on the fly, but some heavy calculations as not. nbdev allows to do that, by just adding a cell at the top with:

--- skip_exec: true ---

weiji14 · 2023-12-27T00:27:42Z

clayground_db.py

Question: Is the vector database playground (a.k.a clayground) code meant to be in this Clay-foundation/model repo too? I know we demo-ed this last week as a proof of concept, but wasn't @Clay-foundation/ode supposed to work on this in a separate repo?

Correct I'll remove it.

This is fully Soumyas work just pushing here for simplicity to share at the moment. squash fix nbdev init added basic read test rendered Start release doc rendered docs better name for lib WIP v0 release docs rendered outout ignore local sync old docs fully ported WIP generated output generated output output generated test new nbdev geenraed output test custom deploy of nbdev docs add jobs no description no inputs add geopandas add tqdm more deps small cleanup docs are inside /docs weird relatives paths WUP rel path trying new workflow script write permission bad params dark color top I think I can ignore locally generated static content dark rename core to model renamed GH pages setting new workflow path path checkout workflow paths ../ actions action docs/_docs/ ignore rendered output ignore rendered output stub clearer move relase to roadmap, fix outputs add image Revert "Add minimal version of clayground" This reverts commit 6e2aac2. Remove all files from docs/docs leave root unchanged

brunosan · 2023-12-27T09:37:36Z

too many errors on this branch. Abandoning it and restarting over.

#102 try 2

brunosan and others added 30 commits September 29, 2023 14:08

Initial commit

06f6097

Update README.md

d791e4d

Update README.md

c8b1995

Update README.md

2eb5b2a

Add geopandas-base (#34)

cdf2c68

Geographic pandas extensions!

Landcover based sampling strategy. (#29)

8def26e

* Add landcover based sampling scripts Closes #28 * Drop duplicates, fix typo, uncomment compute_stats function. * Fix comment that was out of sync with code

Small change: replace placeholder with our name and date (#42)

ef4846d

replace placeholder with our name and date

Ready for batch (#44)

e03b030

Merging for testing on batch. * Integrate tiler and s3 upload to data pipeline * Remove unused file

Create CODE_OF_CONDUCT.md (#53)

750f14e

Using the Contributor Covenant Code of Conduct from https://www.contributor-covenant.org/version/2/1/code_of_conduct.html

Add rioxarray (#59)

6f50653

➕ Add rioxarray Rasterio xarray extension! Repo at https://github.com/corteva/rioxarray

weiji14 and others added 13 commits December 8, 2023 13:36

ignore datadisk

d2143af

brunosan self-assigned this Dec 26, 2023

weiji14 marked this pull request as draft December 26, 2023 23:59

weiji14 requested changes Dec 27, 2023

View reviewed changes

brunosan closed this Dec 27, 2023

brunosan force-pushed the nbdev branch from 35d0e35 to 3483be2 Compare December 27, 2023 09:25

brunosan added a commit that referenced this pull request Dec 27, 2023

init docs/SDK/notebooks with nbdev

17c8f06

#102 try 2

brunosan mentioned this pull request Dec 27, 2023

Docs/SDK/notebooks with nbdev [try 2] #103

Closed

brunosan deleted the nbdev branch December 27, 2023 09:48

weiji14 mentioned this pull request Jan 11, 2024

Revert nbdev CI setup to go back to Jupyter Book #112

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nbdev for Clay Documentation, Clay SDK and Clay notebooks all at once #102

Nbdev for Clay Documentation, Clay SDK and Clay notebooks all at once #102

brunosan commented Dec 26, 2023

weiji14 left a comment •

edited

Loading

weiji14 Dec 27, 2023 •

edited

Loading

brunosan Dec 27, 2023

weiji14 Dec 27, 2023

brunosan Dec 27, 2023

weiji14 Dec 27, 2023

brunosan Dec 27, 2023

weiji14 Dec 27, 2023

brunosan Dec 27, 2023

weiji14 Dec 27, 2023

brunosan Dec 27, 2023

weiji14 Dec 27, 2023

brunosan Dec 27, 2023

brunosan commented Dec 27, 2023

Nbdev for Clay Documentation, Clay SDK and Clay notebooks all at once #102

Nbdev for Clay Documentation, Clay SDK and Clay notebooks all at once #102

Conversation

brunosan commented Dec 26, 2023

weiji14 left a comment • edited Loading

Choose a reason for hiding this comment

weiji14 Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brunosan commented Dec 27, 2023

weiji14 left a comment •

edited

Loading

weiji14 Dec 27, 2023 •

edited

Loading