Nbdev for Clay Documentation, Clay SDK and Clay notebooks all at once #102

* Initial conda environment and binder links Add conda dependency specification file and getting started instructions in the main README.md. The conda environment.yml is paired with a conda-lock.yml lockfile for full reproducibility. Main README.md contains contains quickstart buttons for Binder/Planetary Computer/SageMaker Studio Lab, and steps for installation and usage locally. * ➕ Add zarr An implementation of chunked, compressed, N-dimensional arrays for Python! Repo at https://github.com/zarr-developers/zarr-python

* ➕ Add jsonargparse[signatures] Parsing of command line options, yaml/jsonnet config files and/or environment variables based on argparse! Also adding the signatures extras (which includes typeshed-client). * 🌱 Setup LightningCLI trainer script Setting up the command-line interface to run Lightning. Created a placeholder BaseDataModule and BaseLitModule to hold the data pipeline and model architecture respectively under the src/ folder. Documented in the main README.md on how to run the LightningCLI commands, and also created a src/README.md to document what the python modules in that folder.

* 🙈 Add .gitignore file * 👷 Setup GitHub Actions Continuous Integration tests Running tests on Ubuntu-22.04 and Python 3.11 only for now. Add a parametrized test to ensure that `python trainer.py fit --print_config=skip_null` works (as well as the validate/test subcommands). Tests are ran using `python -m pytest src/tests/`.

* 🔧 Add pre-commit config with pre-commit-hooks and ruff Adding a .pre-commit-config.yaml file with some pre-commit hooks and the ruff linter/formatter. The ruff linter is configured to do autofix, and will run on python scripts and jupyter notebooks. * 🔧 Configure ruff rules with pyproject.toml file Enforce certain ruff formatting and lint rules such as UNIX-style line-endings, pycodestyle, pyflakes, isort, numpy, pylint and pyupgrade. * 🚨 Fix F841 by returning cli variable Fix `F841: Local variable `cli` is assigned to but never used` on trainer.py. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 🔧 Configure pre-commit.ci to run autoupdates on quarterly basis Setting up pre-commit.ci to only run updates quarterly instead of the weekly default. Also explicitly stating that Pull Requests will be autofixed. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Geographic pandas extensions!

* combining data arrays into multi-sensor data cube * implement cql2-json filters for S2 and S1 * script to generate merged datacube * wip function for calculating sentinel 1 scene with max coverage of bbox * formatting * move script and remove notebook * use geom instead of geodataframe for initial aoi * use args in funcs * use CENTROID for geom in cql2 query * Use mosaic method, set singular time dimension based on Sentinel 2 * add configurable args for cloud cover percentage and nodata percentage * use epsg code derived from Sentinel-2 properties, filter by best cloud-free conditions and orbit state * remove extra filter * map s2 for best image using datetime to id, set s2 bands as unique vars, mosaic s1 on time * assign S2 time as dataset variable * remove orbit filter * wrap example in main * move script to subdir * use cloud variable * use cloud variable * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* 🚨 Fix E501 Line too long Wrapping docstrings in scripts/datacube.py to under 88 characters. * ♻️ Refactor best_nodata and best_clouds into single sort function Fixes F841 Local variable `best_nodata` is assigned to but never used. Only the best_clouds variable was used, and best_nodata was omitted, but both should be used. Doing this in a single pandas sort_values function. * 🚑 Quickfix with getting the STAC item with a specific datetime Patch cc99ae4 * 🏷️ Rename variables to ds_ (xr.Dataset) or da (xr.DataArray) Using ds_ prefix for xr.Dataset objects, and da_ prefix for xr.DataArray objects. * 🔧 Set pylint max-args to 6 Increase from default value of 5 to 6, * 🗑️ Replace .get_all_items() with .item_collection() Fixes `FutureWarning: get_all_items() is deprecated, use item_collection() instead`. * 🔥 Remove sorting by nodata and just sort by least cloud cover No need to sort by `s2:nodata_pixel_percentage` anymore, just get the Sentinel-2 STAC item with the least cloud cover. * 📝 More DataArray to Dataset renames Missed a few more da_ to ds_ renames, following from 2af24be

* Add landcover based sampling scripts Closes #28 * Drop duplicates, fix typo, uncomment compute_stats function. * Fix comment that was out of sync with code

replace placeholder with our name and date

* Initial tile module that generates 256x256 tiled xarray datasets from the larger scene-level datacube * update comments * update comments * add doctsings and initial cloud and nodata filter * more efficient cloud and nodata filter * example script to run the datacube and tiler modules * adjust cloud filter * return valid region of datacube pre-tiling * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix datacube processor (#43) * Fix bugs introduced in PR #43 * some cosmetic updates * add a catch for sampled dates which don't have S1 scenes within the +/- 3 day surrounding interval * lower bad pixel percentage * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Wiesmann <yellowcap@users.noreply.github.com>

* 🚨 Fix line-length, boolean comparison and import errors Fix linter errors: - E501 Line too long - E712 Comparison to `False` should be `cond is False` or `if not cond:` - E402 Module level import not at top of file * ✏️ Remove sys.path.append line

* ⬆️ Bump conda-lock from 2.4.2 to 2.5.1 Bumps [conda-lock](https://github.com/conda/conda-lock) from 2.4.2 to 2.5.1. - [Release notes](https://github.com/conda/conda-lock/releases) - [Commits](conda/conda-lock@v2.4.2...v2.5.1) * ➕ Add fiona Fiona reads and writes spatial data files! * ➕ Add h5netcdf Pythonic interface to netCDF4 via h5py!

* 📌 Pin to Pytorch 2.0 and CUDA 11.2 Somehow using the `--with-cuda=11.8` flag in conda-lock didn't work as expected to get the CUDA-built Pytorch instead of the CPU version. Temporarily downgrading from Pytorch 2.1 to 2.0 and CUDA 11.8 to 11.2, to make it possible to install torchvision=0.15.2 from conda-forge later. * 🚧 Initial Vision Transformer architecture with MAE decoder Initializing the neural network architecture layers, specifically a Vision Transformer (ViT) B/32 backbone and a Masked Autoencoder (MAE) decoder. Using Lightly for the MAE setup, with the ViT backbone from torchvision. Setup is mostly adapted from https://github.com/lightly-ai/lightly/blob/v1.4.21/examples/pytorch_lightning/mae.py * ➕ Add transformers State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow! * 🏗️ Switch from torchvision to transformers ViTMAE Changing from lightly/torchvision's ViTMAE implementation to HuggingFace transformers's ViTMAE. This allows us to configure the number of input channels to a number other than 3 (e.g. 12). However, transformer's ViTMAE is an all-in-one class rather than an Encoder/Decoder split (though there's a way to access either once the class is instantiated). Allowed for configuring the masking_ratio instead of the decoder_dim size, and removed the MSE loss because it is implemented in the ViTMAE class already. * 👔 Implement forward pass and training_step Run input images through the encoder and decoder, and compute the pixel reconstruction loss from training the Masked Autoencoder. * ✅ Add unit test for MAELitModule Ensure that running one training step on a mini-batch works. Created a random torch Dataset that generates tensors of shape (12, 256, 256) until there is real data to train on. * 📌 Pin to CUDA 11.8 No need to pin to CUDA 11.2 since not using torchvision anymore. Patches 06535cd * 🗃️ Increase input channels from 12 to 13 The datacube has 13 channels, namely 10 from Sentinel-2's 10m and 20m resolution bands, 2 from Sentinel-1's VV and VH, and 1 from the Copernicus DEM. * 🐛 Remove hardcoded batch_size in assert statements Use a variable self.B instead of hardcoding 32 as the batch_size in the assert statements checking the tensor shape, so that the last mini-batch with a size less than 32 can be seen by the model. * 🚚 Rename to model_vit.py and ViTLitModule Rename MAELitModule to ViTLitModule, and model.py to model_vit.py, since we might be trying out different neural network model architectures later.

Merging for testing on batch. * Integrate tiler and s3 upload to data pipeline * Remove unused file

⬆️ Bump pytorch from 2.0.0 to 2.1.0, CUDA from 11.8 to 12.0 Bumps [torch](https://github.com/pytorch/pytorch) from 2.0.0 to 2.1.0. - [Release notes](https://github.com/pytorch/pytorch/releases) - [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md) - [Commits](pytorch/pytorch@v2.0.0...v2.1.0) Also changing from the CUDA 11.8 build to the CUDA 12.0 build

* ➕ Add torchdata A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries! * ♻️ Refactor test_model_vit to use datapipe fixture Decoupling the neural network model's unit test from the LightningDataModule by implementing a standalone datapipe fixture instead. * ✨ Implement GeoTIFFDataPipeModule Create a LightningDataModule to load GeoTIFF files. Uses torchdata to create the data pipeline. Using the FileLister DataPipe to iterate over *.tif files in the data/ folder, and do a random 80/20 split for the training and validation set. The GeoTIFF files are read into numpy.ndarrrays using rasterio, and converted to torch.Tensors with the default collate function. Using rasterio instead of rioxarray to reduce an extra layer of overhead in the data loading. * 🧵 Allow configuring num_workers in DataLoader Enable setting the number of subprocesses used for data loading. Default to 8 for now, but can be configured on LightningCLI using `python trainer.py fit --data.num_workers=8`. * 📌 Install torchdata=0.7.1 from conda-forge instead of PyPI Contains a build of torchdata that is pre-compiled with the correct AWSSDK extension, and won't result in errors like `ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?)`. * 🔧 Allow configuring data path containing the GeoTIFF files Enable setting the path to the folder containing the GeoTIFF data files. Defaults to data/ for now, but can be configured on LightningCLI using `python trainer.py fit --data.data_path=data/56HKH`. Also setting the recursive=True flag to allow for files in nested directories. * ✅ Add unit test for GeoTIFFDataModule Ensure that loading one mini-batch of data from a data folder works. Created two temporary random GeoTIFF files containing arrays of shape (3, 256, 256) in a fixture for the test.

* 🔧 Configure ModelCheckpoint callback Save model weights to a properly named checkpoint file like vit_epoch-09_train_loss-3250218.25.ckpt, stored in the checkpoints/ folder by default. More configuration can be done through LightningCLI, see `python trainer.py fit --trainer.callbacks.help=ModelCheckpoint`. * ⚡ Add AsyncCheckpointIO plugin to trainer Enable the experimental plugin that saves checkpoint files asynchronously in a thread. See https://lightning.ai/docs/pytorch/2.1.0/api/lightning.pytorch.plugins.io.AsyncCheckpointIO.html.

Using the Contributor Covenant Code of Conduct from https://www.contributor-covenant.org/version/2/1/code_of_conduct.html

➕ Add rioxarray Rasterio xarray extension! Repo at https://github.com/corteva/rioxarray

* 🍻 Generate embeddings via prediction loop Implement the embedding generator in the LightningModule's predict_step. The embeddings are tensor arrays that are saved to a .npy file in the data/embeddings/ folder. Input data is retrieved from the predict_dataloader, which is currently using the validation datapipe rather than a dedicated datapipe. Have documented how to generate the embedding output file using LightningCLI on the main README.md file. Also added a unit test to ensure that saving and loading from an embedding_0.npy file works. * 🐛 Disable masking of patches on predict_step Previously, 75% of the patches, or 48 out of a total of 64 were masked out, leaving 16 patches plus 1 cls_token = 17 sequences. Disabling the mask gives 64 + 1 cls_token = 65 sequences. Moved some assert statements with a fixed sequence_length dim from the forward function to the training_step. Also updated the unit test to ensure output embeddings have a shape like (batch_size, 65, 768). * ♻️ Refactor LightningDataModule to not do random split on predict Refactoring the setup method in the LightningDataModule to not do a random split on the predict stage. I.e. just do the GeoTIFF to torch.Tensor conversion directly, followed by batching and collating. * ✅ Test predict stage in geotiffdatamodule Need to explicitly pass an argument to stage in the test_geotiffdatapipemodule unit test. Testing both the fit and predict stages. * 👔 Ensure that embeddings have no NaN values Make sure that the generated embeddings do not have NaN values in them. * 🗃️ Take mean of the embeddings along sequence_length dim Instead of saving embeddings of shape (1, 65, 768), save out embeddings of shape (1, 768) instead. Done by taking the mean along the sequence_length dim, except for the cls_token part (first index in the 65).

* Add bucket as argument to cli * Improve efficency of datacube Keep S2 in Uint16 as long as possible, subset using indexing instead of sel * Simplify print statements * Add and document batch setup * Add sample as geopackage Geojson was too big for the linter to be happy * Small edit on README

* 🗃️ Let LightningDataModule return spatiotemporal metadata Making the LightningDataModule return not only the image, but also spatiotemporal metadata such as the bounding box, coordinate reference system, and date. The bbox and crs is in the raster image's native UTM projection for now, while the date is just a YYYY-MM-DD formatted string. Unit tests have been updated to ensure that the extra metadata is passed through. * ♻️ Refactor test_geotiffdatapipemodule to use parametrization Reduce duplicate code py using pytest.mark.parametrize, looping over fit and predict stages. * 📝 Document returned outputs from _array_to_torch function Improved the docstring of the _array_to_torch function, mentioning the input parameters (filepath) and the contents of the output dictionary (image, bbox, crs, date). Also updated the type hint of the function. * 🚚 Rename crs to epsg Since we're storing the EPSG integer and not the CRS representation.

* Remove default for subset This is easy to forget and then run with subset without the intention to actually subset * Improve file name Closes #69 - Zero padding for counter - v before version number - Underscores instead of hyphon separators - Drop hyphons from date stamp * Bump version to 02 * Make mgrs sample file external Closes #71 * Add date to raster metadata Closes #70 * Improve print statement

* Create model-license.md * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename model-license.md to LICENSE-MODEL.md --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

… and DEM (#60) * check for no data on a tile level in sentinel 1 vv and vh, sentinel 2 and DEM * adjust to run consecutively instead of all together, prevents unnecesary calculations * adjust per nodata type in other bands * Simplify nodata check by converting to loop --------- Co-authored-by: Daniel Wiesmann <daniel@wiesmann.pt>

* Improve date handling for data pipeline If no match is found for a year, others are being tried until a match is found or all years have been tested * Increase tile size to 512x512 pixels. Closes #78 * Increase dates per location to 3 Closes #79 * Prevent printing s3 sync upload progress logs * Move counter above cloud filter to ensure index consistency Like this the tile IDs in the file names should be consistent across dates. * Fix typo in comment * Update batch run setup to new bucket name

* ✨ Save embeddings with spatiotemporal metadata to GeoParquet Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back. * 📝 Document how embeddings are generated and saved to geoparquet Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like. * 📝 Mention in main README.md that embeddings are saved to geoparquet Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024. * 🎨 Update type hint of batch inputs, and add some inline comments Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.

* 🔧 Increase image_size from 256 to 512, patch_size from 32 to 64 Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo. * 👽 Get YYYY-MM-DD from GeoTIFF tag instead of filename Obtaining the YYYY-MM-DD date from the GeoTIFF's tag metadata, instead of parsing it from the filename, thanks to the change at 426aa06/#72. * ✨ Allow GeoTIFFDataModule to get GeoTIFF data from an s3 bucket New feature to allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Added a unit test that checks that this works to list a GeoTIFF file from s3://copernicus-dem-30m/. Also improved the docstring and type hint of the setup() function's 'stage' parameter. * 🐛 Add sharding filter before loading GeoTIFF data to torch.Tensor Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result. * 🙈 Gitignore checkpoints in nested folders Ensure that *.ckpt files in sub-folders are ignored too. * ⚡ Set float32 matmul precision to medium Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.` * 📝 Mention in main README.md that data_path can be an s3 bucket Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.

…88) * ➕ Add wandb A CLI and library for interacting with the Weights and Biases API! * 🔊 Log Masked Autoencoder reconstructions to WandB Created a custom callback function to log visualizations of the input and output images to the Masked Autoencoder. Only showing the RGB bands of Sentinel-2 for now. A sample of 6 image pairs (original + reconstructed, so 12 in total) is uploaded to Weights and Biases. Example LightningCLI command: `python trainer.py fit --trainer.max_epochs=20 --data.data_path=data/32VLM --trainer.logger=WandbLogger --trainer.logger.project=clay --trainer.logger.save_dir=checkpoints --trainer.callbacks+=LogMAEReconstructedImage`. * ➕ Add scikit-image Image processing in Python! * 📸 Apply histogram equalization to RGB images Enhance low contrast images by applying a histogram equalization stretching algorithm on the RGB images, instead of dividing by a magic number like 6000. * 🔧 Increase default sample size from 6 to 8 More samples to look at! Also only running einsum conversion on as many samples as needed rather than the whole batch, and handling cases where num_samples may be more than the batch_size. * 🧑‍💻 Make wandb a somewhat optional dependency Allows for `from src.callback_wandb import LogMAEReconstruction` to run, even without wandb being installed. Helpful if someone doesn't want to install wandb for whatever reason. * ✅ Add unit test for LogMAEReconstruction Testing that the LogMAEReconstruction callback works to save a set of images to WandB. Testing this in offline mode only, with checks that artifacts are saved locally, and that the wandb images have the correct caption and format. * 🐛 Compare expected folders using set instead of list Order of the folders could change, so using set instead of list. * 🧪 Prevent WandB logger from saving logs to local drive for now Setting WANDB_MODE="disabled", so no files are logged to disk, though the wandb.Image(s) are still created. See if this helps to resolve the exit code 255 issue on GitHub Actions. * 📝 Fix a typo and improve docstring Minor changes to the docstring of the on_validation_batch_end method, and a typo fix.

…dings (#47) * Add modified ViT to encode latlon, time, channels & position embeddings * Add MAE for modified ViT * Add docstrings & fix issue with complex indexing * Fix the comments on loss computation * Add datamodule & trainer to run an epoch of training * Normalize data before feeding to the model * Add fixed sincos embedding for position & bands * Add logging & ckpt options * Fix the order of coords from lat,lon to lon,lat * Add clay tiny,small,medium,large model versions * Remove hardcoded patch size in LogIntermediatePredictions callback Retrieve the patch size value from the model architecture, rather than hardcoding as 32. Also ensure that the input image shape is the same as the predicted image from the decoder. * Run clay small on image size 512 for 10 epochs with grad_acc * Make the clay construction configurable * Return the data path to reference for vector embeddings * Remove duplicate dataset.py & geovit.py * 🔀 Merge srm_trainer.py into trainer.py Have one entrypoint to run the model using Lightning CLI. Switched model from VitLitModule to CLAYModule, and datamodule from GeoTIFFDataPipeModule to ClayDataModule. Temporarily disabling the logging and monitoring callbacks for now. * 🔀 Combine clay.py and model.py into model_clay Putting the CLAYModule (LightningModule) together with the CLAY torch.nn.Module in a single model_clay.py file. Have mentioned in src/README.md that model_clay.py is the one with custom spatiotemporal encoders, while the previous model_vit.py contains vanilla Vision Transformer implementation. * ➕ Add matplotlib-base Publication quality figures in Python! * 🚚 Move ClayDataset and ClayDataModule into datamodule.py Putting the DataLoader code in one file - datamodule.py. The regular torch Dataset classes are placed on top of the existing torchdata-based functions/classes. * 🚚 Move LogIntermediatePredictions callback into callbacks_wandb Moving the LogIntermediatePredictions callback class from callbacks.py into callbacks_wandb.py. * ♻️ Get WandB logger properly using a shared function Getting the WandbLogger directly from the trainer, rather than having to pass it through __init__. Adapted from https://github.com/ashleve/lightning-hydra-template/blob/334601c0326a50ff301fbd76057b36408cf97ffa/src/callbacks/wandb_callbacks.py#L16C1-L34C6 * 🚨 Wrap docstring and fix too-many-arguments lint error Converted docstrings from numpydoc style which uses less horizontal space but more vertical space. Also added a noqa comment for three instances of `PLR0913 Too many arguments in function definition`. --------- Co-authored-by: SRM <soumya@developmentseed.org> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>

* ➕ Add jupyter-book Build a book with Jupyter Notebooks and Sphinx! * 📝 Initialize Jupyter Book Starting with a minimally modified Jupyter Book initialized with `jupyter-book create docs/`. Changed the `_config.yml` to use a proper title and the Clay logo. Included a Binder launch button and a footer with CC-BY-4.0 license. Deleted the sample notebooks.ipynb and markdown-notebooks.md files, and excluded the book/requirements.txt (dependencies will be installed from environment.yml). Put in a placeholder installation page for now in the Table of Contents. * 🚀 Deploy Jupyter Book to GitHub Pages via GitHub Actions Continuous Integration workflow to build the Jupyter Book's html pages and publish it online to GitHub Pages. Based on https://jupyterbook.org/en/stable/publish/gh-pages.html#automatically-host-your-book-with-github-actions, but modernized to use GitHub Actions based publishing source, see https://github.blog/changelog/2022-07-27-github-pages-custom-github-actions-workflows-beta * 📝 Add 'About Clay' section with links to GitHub and LinkedIn pages Add external links to Clay's GitHub organization page, and LinkedIn. * 🔍 Add badges to main README.md Add badges pointing to the Jupyter Book page, and for the deploy-book.yml/test.yml GitHub Action statuses below the title in the main README.md page. Also modified the description into something more compelling.

…url (#86) * 🗃️ Store source_url of GeoTIFF to GeoParquet file Passing the URL or path of the GeoTIFF file through the datapipe, and into the model's prediction loop. The geopandas.GeoDataFrame now has an extra 'source_url' string column, and this is saved to the GeoParquet file too. * 🚚 Save one GeoParquet file for each unique MGRS tile For each MGRS code (e.g. 12ABC), save a GeoParquet file with a name formatted like `{MGRS:5}_v{VERSION:2}.gpq`, e.g. 12ABC_v01.gpq. Have updated the unit test to check that rows with different MGRS codes are saved to different files. * ⚡ Save GeoParquet file with ZSTD compression Using ZStandard compression instead of Parquet's default Snappy compression. Should result is slightly smaller filesizes, and slightly faster data transfer and compression (especially over the network). Also changed an assert statement to an if-then-raise instead. * ♻️ Predict with multiple workers and gather results to save Speed up embedding generation by enabling multiple workers to fetch and load mini-batches of GeoTIFF files independently, and run the prediction. The prediction or generated embeddings from each worker (a geopandas.GeoDataFrame) is then concatenated together row-wise, before getting passed to the GeoParquet output script. This is done via LightningModule's `on_predict_epoch_end` hook. Also documented these new processing steps in the docstring.

…dule (#91) * 🔧 Standardize on a data_dir parameter with a str type The GeoTIFFDataModule was using data_path:str, while ClayDataModule was using data_dir:Path. Standardize both to be data_dir:str instead. Some parts of this commit is adapted from 1009697. Also placed all the ClayDataModule's setup logic under `if stage=='fit'`, to reduce diff when predict step is implemented later. * 🎨 Get YYYY-MM-DD from GeoTIFF tag rather than filename More robust way of obtaining the Sentinel-2 imagery's acquisition date. Also returning the date in the datacube now. * 🎨 Simplify lonlat centroid calculation and return UTM bbox/epsg Can use rasterio's built-in lnglat() method to get the geographic center of the chip, instead of calculating it manually. See https://rasterio.readthedocs.io/en/latest/api/rasterio._base.html#rasterio._base.DatasetBase.lnglat Also returning the original UTM bounding box and EPSG code in the datacube.

* ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85. * 🚚 Rename datacube's path key to source_url Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule * 🚑 Use try-except to get absolute chip_path or fallback to str The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs). * ✨ Implement predict_dataloader for ClayDataModule Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled. * ✅ Add parametrized test for checking ClayDataModule Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted. * 📝 Edit docstrings in test_datamodule.py to be more generic Not just testing one, but two different LightningDataModules now! * 🔧 Add GDAL environment variables that might help with s3 loading Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.

…#95) * ♻️ Better handle pos and band encodings across multi-devices Move the pos_encoding and band_encoding layers to the correct device in a way that allow Lightning to do multi-gpu properly. The reported loss is now synced or reduced/averaged across multiple devices too. Partially cherry-picked from 1a40f56 Co-Authored-By: SRM <soumya@developmentseed.org> * ♻️ Compute num_masked_patches dynamically based on mask_ratio So that the masking can be turned off during prediction using `self.model.encoder.mask_ratio = 0`, where self is an instance of CLAYModule. The num_masked_patches integer value is now calculated on-the-fly by multiplying mask_ratio with num_patches. * 🎨 Register pos_encoding and band_encoding properly on device Since the pos_encoding and band_encoding tensors are declared in the __init__ method, we'll need to register them so that they are moved to the correct device by Lightning during the forward call. See https://lightning.ai/docs/pytorch/2.1.0/starter/converting.html#remove-any-cuda-or-to-device-calls --------- Co-authored-by: SRM <soumya@developmentseed.org>

* 📝 Document how to generate vector embeddings Step by step instructions on how to produce embeddings from the pretrained model. From checking that one has permissions to get the GeoTIFF files, to downloading of the model checkpoint, and running the model prediction to get the GeoParquet output. Also gave a tip on what a suitable VM instance would be like. * 📝 Document details of how the mean embeddings were computed Extra technical details on how the raw (B, 1538, 768) embeddings are turned into (B, 768) shaped embeddings by taking the mean along the spatial patches. * 📝 Document format of the GeoParquet table and how to read it Useful details about the filename convention and table schema of the embeddings stored in GeoParquet format, and some sample GeoPandas code showing how to read a *.gpq file. Also linking to some guides and resources from the Cloud Native Geospatial Foundation. * ✏️ Typo embedding -> embeddings Never sure whether it's singular or plural.

* 📝 Document how to finetune pretrained model on downstrem task Explaining how the pre-trained model can be finetuned after attaching a head to the network. Written by Lilly. --------- Co-authored-by: Lilly Thomas <lilly@developmentseed.org>

📝 Document how the benchmark dataset labels were prepared Mention why we decided to use Cloud to Street for the initial benchmark dataset, and how the imagery and label data was processed to fit into the Clay Foundation model. Written by Lilly. Co-authored-by: Lilly Thomas <lilly@developmentseed.org>

This is fully Soumyas work just pushing here for simplicity to share at the moment. squash fix nbdev init added basic read test rendered Start release doc rendered docs better name for lib WIP v0 release docs rendered outout ignore local sync old docs fully ported WIP generated output generated output output generated test new nbdev geenraed output test custom deploy of nbdev docs add jobs no description no inputs add geopandas add tqdm more deps small cleanup docs are inside /docs weird relatives paths WUP rel path trying new workflow script write permission bad params dark color top I think I can ignore locally generated static content dark rename core to model renamed GH pages setting new workflow path path checkout workflow paths ../ actions action docs/_docs/ ignore rendered output ignore rendered output stub clearer move relase to roadmap, fix outputs add image Revert "Add minimal version of clayground" This reverts commit 6e2aac2. Remove all files from docs/docs leave root unchanged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nbdev for Clay Documentation, Clay SDK and Clay notebooks all at once #102

Nbdev for Clay Documentation, Clay SDK and Clay notebooks all at once #102

Commits on Sep 29, 2023

Commits on Oct 27, 2023

Commits on Nov 8, 2023

Commits on Nov 9, 2023

Commits on Nov 10, 2023

Commits on Nov 15, 2023

Commits on Nov 16, 2023

Commits on Nov 17, 2023

Commits on Nov 21, 2023

Commits on Nov 22, 2023

Commits on Nov 24, 2023

Commits on Nov 28, 2023

Commits on Nov 29, 2023

Commits on Nov 30, 2023

Commits on Dec 1, 2023

Commits on Dec 4, 2023

Commits on Dec 6, 2023

Commits on Dec 7, 2023

Commits on Dec 8, 2023

Commits on Dec 11, 2023

Commits on Dec 15, 2023

Commits on Dec 17, 2023

Commits on Dec 19, 2023

Commits on Dec 20, 2023

Commits on Dec 22, 2023

Commits on Dec 27, 2023