Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nbdev for Clay Documentation, Clay SDK and Clay notebooks all at once #102

Closed
wants to merge 44 commits into from

Commits on Sep 29, 2023

  1. Initial commit

    brunosan authored Sep 29, 2023
    Configuration menu
    Copy the full SHA
    06f6097 View commit details
    Browse the repository at this point in the history
  2. Update README.md

    brunosan authored Sep 29, 2023
    Configuration menu
    Copy the full SHA
    d791e4d View commit details
    Browse the repository at this point in the history

Commits on Oct 27, 2023

  1. Update README.md

    brunosan authored Oct 27, 2023
    Configuration menu
    Copy the full SHA
    c8b1995 View commit details
    Browse the repository at this point in the history
  2. Update README.md

    brunosan authored Oct 27, 2023
    Configuration menu
    Copy the full SHA
    2eb5b2a View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2023

  1. Initial conda environment and binder links (#15)

    * Initial conda environment and binder links
    
    Add conda dependency specification file and getting started instructions in the main README.md. The conda environment.yml is paired with a conda-lock.yml lockfile for full reproducibility. Main README.md contains contains quickstart buttons for Binder/Planetary Computer/SageMaker Studio Lab, and steps for installation and usage locally.
    
    * ➕ Add zarr
    
    An implementation of chunked, compressed, N-dimensional arrays for Python! Repo at https://github.com/zarr-developers/zarr-python
    weiji14 authored Nov 8, 2023
    Configuration menu
    Copy the full SHA
    331ed5e View commit details
    Browse the repository at this point in the history

Commits on Nov 9, 2023

  1. Setup LightningCLI trainer script (#24)

    * ➕ Add jsonargparse[signatures]
    
    Parsing of command line options, yaml/jsonnet config files and/or environment variables based on argparse! Also adding the signatures extras (which includes typeshed-client).
    
    * 🌱 Setup LightningCLI trainer script
    
    Setting up the command-line interface to run Lightning. Created a placeholder BaseDataModule and BaseLitModule to hold the data pipeline and model architecture respectively under the src/ folder. Documented in the main README.md on how to run the LightningCLI commands, and also created a src/README.md to document what the python modules in that folder.
    weiji14 authored Nov 9, 2023
    Configuration menu
    Copy the full SHA
    bb44d43 View commit details
    Browse the repository at this point in the history
  2. Setup GitHub Actions Continuous Integration tests (#25)

    * 🙈 Add .gitignore file
    
    * 👷 Setup GitHub Actions Continuous Integration tests
    
    Running tests on Ubuntu-22.04 and Python 3.11 only for now. Add a parametrized test to ensure that `python trainer.py fit --print_config=skip_null` works (as well as the validate/test subcommands). Tests are ran using `python -m pytest src/tests/`.
    weiji14 authored Nov 9, 2023
    Configuration menu
    Copy the full SHA
    1c6de6a View commit details
    Browse the repository at this point in the history

Commits on Nov 10, 2023

  1. Add pre-commit hooks with ruff formatter/linter rules (#26)

    * 🔧 Add pre-commit config with pre-commit-hooks and ruff
    
    Adding a .pre-commit-config.yaml file with some pre-commit hooks and the ruff linter/formatter. The ruff linter is configured to do autofix, and will run on python scripts and jupyter notebooks.
    
    * 🔧 Configure ruff rules with pyproject.toml file
    
    Enforce certain ruff formatting and lint rules such as UNIX-style line-endings, pycodestyle, pyflakes, isort, numpy, pylint and pyupgrade.
    
    * 🚨 Fix F841 by returning cli variable
    
    Fix `F841: Local variable `cli` is assigned to but never used` on trainer.py.
    
    * [pre-commit.ci] auto fixes from pre-commit.com hooks
    
    for more information, see https://pre-commit.ci
    
    * 🔧 Configure pre-commit.ci to run autoupdates on quarterly basis
    
    Setting up pre-commit.ci to only run updates quarterly instead of the weekly default. Also explicitly stating that Pull Requests will be autofixed.
    
    ---------
    
    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    weiji14 and pre-commit-ci[bot] authored Nov 10, 2023
    Configuration menu
    Copy the full SHA
    4642c1e View commit details
    Browse the repository at this point in the history

Commits on Nov 15, 2023

  1. Add geopandas-base (#34)

    Geographic pandas extensions!
    weiji14 authored Nov 15, 2023
    Configuration menu
    Copy the full SHA
    cdf2c68 View commit details
    Browse the repository at this point in the history
  2. Datacube (#27)

    * combining data arrays into multi-sensor data cube
    
    * implement cql2-json filters for S2 and S1
    
    * script to generate merged datacube
    
    * wip function for calculating sentinel 1 scene with max coverage of bbox
    
    * formatting
    
    * move script and remove notebook
    
    * use geom instead of geodataframe for initial aoi
    
    * use args in funcs
    
    * use CENTROID for geom in cql2 query
    
    * Use mosaic method, set singular time dimension based on Sentinel 2
    
    * add configurable args for cloud cover percentage and nodata percentage
    
    * use epsg code derived from Sentinel-2 properties, filter by best cloud-free conditions and orbit state
    
    * remove extra filter
    
    * map s2 for best image using datetime to id, set s2 bands as unique vars, mosaic s1 on time
    
    * assign S2 time as dataset variable
    
    * remove orbit filter
    
    * wrap example in main
    
    * move script to subdir
    
    * use cloud variable
    
    * use cloud variable
    
    * [pre-commit.ci] auto fixes from pre-commit.com hooks
    
    for more information, see https://pre-commit.ci
    
    ---------
    
    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    lillythomas and pre-commit-ci[bot] authored Nov 15, 2023
    Configuration menu
    Copy the full SHA
    1d1c013 View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2023

  1. Fix lint errors in datacube script (#36)

    * 🚨 Fix E501 Line too long
    
    Wrapping docstrings in scripts/datacube.py to under 88 characters.
    
    * ♻️ Refactor best_nodata and best_clouds into single sort function
    
    Fixes F841 Local variable `best_nodata` is assigned to but never used. Only the best_clouds variable was used, and best_nodata was omitted, but both should be used. Doing this in a single pandas sort_values function.
    
    * 🚑 Quickfix with getting the STAC item with a specific datetime
    
    Patch cc99ae4
    
    * 🏷️ Rename variables to ds_ (xr.Dataset) or da (xr.DataArray)
    
    Using ds_ prefix for xr.Dataset objects, and da_ prefix for xr.DataArray objects.
    
    * 🔧 Set pylint max-args to 6
    
    Increase from default value of 5 to 6,
    
    * 🗑️ Replace .get_all_items() with .item_collection()
    
    Fixes `FutureWarning: get_all_items() is deprecated, use item_collection() instead`.
    
    * 🔥 Remove sorting by nodata and just sort by least cloud cover
    
    No need to sort by `s2:nodata_pixel_percentage` anymore, just get the Sentinel-2 STAC item with the least cloud cover.
    
    * 📝 More DataArray to Dataset renames
    
    Missed a few more da_ to ds_ renames, following from 2af24be
    weiji14 authored Nov 16, 2023
    Configuration menu
    Copy the full SHA
    a1257d0 View commit details
    Browse the repository at this point in the history
  2. Landcover based sampling strategy. (#29)

    * Add landcover based sampling scripts
    
    Closes #28
    
    * Drop duplicates, fix typo, uncomment compute_stats function.
    
    * Fix comment that was out of sync with code
    yellowcap authored Nov 16, 2023
    Configuration menu
    Copy the full SHA
    8def26e View commit details
    Browse the repository at this point in the history

Commits on Nov 17, 2023

  1. Small change: replace placeholder with our name and date (#42)

    replace placeholder with our name and date
    brunosan authored Nov 17, 2023
    Configuration menu
    Copy the full SHA
    ef4846d View commit details
    Browse the repository at this point in the history
  2. Tiler module (#41)

    * Initial tile module that generates 256x256 tiled xarray datasets from the larger scene-level datacube
    
    * update comments
    
    * update comments
    
    * add doctsings and initial cloud and nodata filter
    
    * more efficient cloud and nodata filter
    
    * example script to run the datacube and tiler modules
    
    * adjust cloud filter
    
    * return valid region of datacube pre-tiling
    
    * [pre-commit.ci] auto fixes from pre-commit.com hooks
    
    for more information, see https://pre-commit.ci
    
    * Fix datacube processor (#43)
    
    * Fix bugs introduced in PR #43
    
    * some cosmetic updates
    
    * add a catch for sampled dates which don't have S1 scenes within the +/- 3 day surrounding interval
    
    * lower bad pixel percentage
    
    * [pre-commit.ci] auto fixes from pre-commit.com hooks
    
    for more information, see https://pre-commit.ci
    
    ---------
    
    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    Co-authored-by: Daniel Wiesmann <yellowcap@users.noreply.github.com>
    3 people authored Nov 17, 2023
    Configuration menu
    Copy the full SHA
    3451849 View commit details
    Browse the repository at this point in the history

Commits on Nov 21, 2023

  1. Fix line-length, boolean comparison and import errors (#45)

    * 🚨 Fix line-length, boolean comparison and import errors
    
    Fix linter errors:
    - E501 Line too long
    - E712 Comparison to `False` should be `cond is False` or `if not cond:`
    - E402 Module level import not at top of file
    
    * ✏️ Remove sys.path.append line
    weiji14 authored Nov 21, 2023
    Configuration menu
    Copy the full SHA
    1a859af View commit details
    Browse the repository at this point in the history
  2. Bump conda-lock to 2.5.1, add fiona and h5netcdf (#46)

    * ⬆️ Bump conda-lock from 2.4.2 to 2.5.1
    
    Bumps [conda-lock](https://github.com/conda/conda-lock) from 2.4.2 to 2.5.1.
    - [Release notes](https://github.com/conda/conda-lock/releases)
    - [Commits](conda/conda-lock@v2.4.2...v2.5.1)
    
    * ➕ Add fiona
    
    Fiona reads and writes spatial data files!
    
    * ➕ Add h5netcdf
    
    Pythonic interface to netCDF4 via h5py!
    weiji14 authored Nov 21, 2023
    Configuration menu
    Copy the full SHA
    ddf90b7 View commit details
    Browse the repository at this point in the history
  3. Initial Vision Transformer architecture with MAE decoder (#37)

    * 📌 Pin to Pytorch 2.0 and CUDA 11.2
    
    Somehow using the `--with-cuda=11.8` flag in conda-lock didn't work as expected to get the CUDA-built Pytorch instead of the CPU version. Temporarily downgrading from Pytorch 2.1 to 2.0 and CUDA 11.8 to 11.2, to make it possible to install torchvision=0.15.2 from conda-forge later.
    
    * 🚧 Initial Vision Transformer architecture with MAE decoder
    
    Initializing the neural network architecture layers, specifically a Vision Transformer (ViT) B/32 backbone and a Masked Autoencoder (MAE) decoder. Using Lightly for the MAE setup, with the ViT backbone from torchvision. Setup is mostly adapted from https://github.com/lightly-ai/lightly/blob/v1.4.21/examples/pytorch_lightning/mae.py
    
    * ➕ Add transformers
    
    State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow!
    
    * 🏗️ Switch from torchvision to transformers ViTMAE
    
    Changing from lightly/torchvision's ViTMAE implementation to HuggingFace transformers's ViTMAE. This allows us to configure the number of input channels to a number other than 3 (e.g. 12). However, transformer's ViTMAE is an all-in-one class rather than an Encoder/Decoder split (though there's a way to access either once the class is instantiated). Allowed for configuring the masking_ratio instead of the decoder_dim size, and removed the MSE loss because it is implemented in the ViTMAE class already.
    
    * 👔 Implement forward pass and training_step
    
    Run input images through the encoder and decoder, and compute the pixel reconstruction loss from training the Masked Autoencoder.
    
    * ✅ Add unit test for MAELitModule
    
    Ensure that running one training step on a mini-batch works. Created a random torch Dataset that generates tensors of shape (12, 256, 256) until there is real data to train on.
    
    * 📌 Pin to CUDA 11.8
    
    No need to pin to CUDA 11.2 since not using torchvision anymore. Patches 06535cd
    
    * 🗃️ Increase input channels from 12 to 13
    
    The datacube has 13 channels, namely 10 from Sentinel-2's 10m and 20m resolution bands, 2 from Sentinel-1's VV and VH, and 1 from the Copernicus DEM.
    
    * 🐛 Remove hardcoded batch_size in assert statements
    
    Use a variable self.B instead of hardcoding 32 as the batch_size in the assert statements checking the tensor shape, so that the last mini-batch with a size less than 32 can be seen by the model.
    
    * 🚚 Rename to model_vit.py and ViTLitModule
    
    Rename MAELitModule to ViTLitModule, and model.py to model_vit.py, since we might be trying out different neural network model architectures later.
    weiji14 authored Nov 21, 2023
    Configuration menu
    Copy the full SHA
    cdd900b View commit details
    Browse the repository at this point in the history

Commits on Nov 22, 2023

  1. Ready for batch (#44)

    Merging for testing on batch.
    
    * Integrate tiler and s3 upload to data pipeline
    
    * Remove unused file
    yellowcap authored Nov 22, 2023
    Configuration menu
    Copy the full SHA
    e03b030 View commit details
    Browse the repository at this point in the history

Commits on Nov 24, 2023

  1. Bump pytorch from 2.0.0 to 2.1.0, CUDA from 11.8 to 12.0 (#51)

    ⬆️ Bump pytorch from 2.0.0 to 2.1.0, CUDA from 11.8 to 12.0
    
    Bumps [torch](https://github.com/pytorch/pytorch) from 2.0.0 to 2.1.0.
    - [Release notes](https://github.com/pytorch/pytorch/releases)
    - [Changelog](https://github.com/pytorch/pytorch/blob/main/RELEASE.md)
    - [Commits](pytorch/pytorch@v2.0.0...v2.1.0)
    
    Also changing from the CUDA 11.8 build to the CUDA 12.0 build
    weiji14 authored Nov 24, 2023
    Configuration menu
    Copy the full SHA
    c8970c6 View commit details
    Browse the repository at this point in the history

Commits on Nov 28, 2023

  1. LightningDataModule to load GeoTIFF files (#52)

    * ➕ Add torchdata
    
    A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries!
    
    * ♻️ Refactor test_model_vit to use datapipe fixture
    
    Decoupling the neural network model's unit test from the LightningDataModule by implementing a standalone datapipe fixture instead.
    
    * ✨ Implement GeoTIFFDataPipeModule
    
    Create a LightningDataModule to load GeoTIFF files. Uses torchdata to create the data pipeline. Using the FileLister DataPipe to iterate over *.tif files in the data/ folder, and do a random 80/20 split for the training and validation set. The GeoTIFF files are read into numpy.ndarrrays using rasterio, and converted to torch.Tensors with the default collate function. Using rasterio instead of rioxarray to reduce an extra layer of overhead in the data loading.
    
    * 🧵 Allow configuring num_workers in DataLoader
    
    Enable setting the number of subprocesses used for data loading. Default to 8 for now, but can be configured on LightningCLI using `python trainer.py fit --data.num_workers=8`.
    
    * 📌 Install torchdata=0.7.1 from conda-forge instead of PyPI
    
    Contains a build of torchdata that is pre-compiled with the correct AWSSDK extension, and won't result in errors like `ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?)`.
    
    * 🔧 Allow configuring data path containing the GeoTIFF files
    
    Enable setting the path to the folder containing the GeoTIFF data files. Defaults to data/ for now, but can be configured on LightningCLI using `python trainer.py fit --data.data_path=data/56HKH`. Also setting the recursive=True flag to allow for files in nested directories.
    
    * ✅ Add unit test for GeoTIFFDataModule
    
    Ensure that loading one mini-batch of data from a data folder works. Created two temporary random GeoTIFF files containing arrays of shape (3, 256, 256) in a fixture for the test.
    weiji14 authored Nov 28, 2023
    Configuration menu
    Copy the full SHA
    7935fd2 View commit details
    Browse the repository at this point in the history

Commits on Nov 29, 2023

  1. Configure model checkpointing (#55)

    * 🔧 Configure ModelCheckpoint callback
    
    Save model weights to a properly named checkpoint file like vit_epoch-09_train_loss-3250218.25.ckpt, stored in the checkpoints/ folder by default. More configuration can be done through LightningCLI, see `python trainer.py fit --trainer.callbacks.help=ModelCheckpoint`.
    
    * ⚡ Add AsyncCheckpointIO plugin to trainer
    
    Enable the experimental plugin that saves checkpoint files asynchronously in a thread. See https://lightning.ai/docs/pytorch/2.1.0/api/lightning.pytorch.plugins.io.AsyncCheckpointIO.html.
    weiji14 authored Nov 29, 2023
    Configuration menu
    Copy the full SHA
    90f0c6f View commit details
    Browse the repository at this point in the history

Commits on Nov 30, 2023

  1. Create CODE_OF_CONDUCT.md (#53)

    Using the Contributor Covenant Code of Conduct from https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
    brunosan authored Nov 30, 2023
    Configuration menu
    Copy the full SHA
    750f14e View commit details
    Browse the repository at this point in the history

Commits on Dec 1, 2023

  1. Add rioxarray (#59)

    ➕ Add rioxarray
    
    Rasterio xarray extension! Repo at https://github.com/corteva/rioxarray
    weiji14 authored Dec 1, 2023
    Configuration menu
    Copy the full SHA
    6f50653 View commit details
    Browse the repository at this point in the history

Commits on Dec 4, 2023

  1. Generate embeddings via prediction loop (#56)

    * 🍻 Generate embeddings via prediction loop
    
    Implement the embedding generator in the LightningModule's predict_step. The embeddings are tensor arrays that are saved to a .npy file in the data/embeddings/ folder. Input data is retrieved from the predict_dataloader, which is currently using the validation datapipe rather than a dedicated datapipe. Have documented how to generate the embedding output file using LightningCLI on the main README.md file. Also added a unit test to ensure that saving and loading from an embedding_0.npy file works.
    
    * 🐛 Disable masking of patches on predict_step
    
    Previously, 75% of the patches, or 48 out of a total of 64 were masked out, leaving 16 patches plus 1 cls_token = 17 sequences. Disabling the mask gives 64 + 1 cls_token = 65 sequences. Moved some assert statements with a fixed sequence_length dim from the forward function to the training_step. Also updated the unit test to ensure output embeddings have a shape like (batch_size, 65, 768).
    
    * ♻️ Refactor LightningDataModule to not do random split on predict
    
    Refactoring the setup method in the LightningDataModule to not do a random split on the predict stage. I.e. just do the GeoTIFF to torch.Tensor conversion directly, followed by batching and collating.
    
    * ✅ Test predict stage in geotiffdatamodule
    
    Need to explicitly pass an argument to stage in the test_geotiffdatapipemodule unit test. Testing both the fit and predict stages.
    
    * 👔 Ensure that embeddings have no NaN values
    
    Make sure that the generated embeddings do not have NaN values in them.
    
    * 🗃️ Take mean of the embeddings along sequence_length dim
    
    Instead of saving embeddings of shape (1, 65, 768), save out embeddings of shape (1, 768) instead. Done by taking the mean along the sequence_length dim, except for the cls_token part (first index in the 65).
    weiji14 authored Dec 4, 2023
    Configuration menu
    Copy the full SHA
    69ce703 View commit details
    Browse the repository at this point in the history

Commits on Dec 6, 2023

  1. Batch setup (#54)

    * Add bucket as argument to cli
    
    * Improve efficency of datacube
    
    Keep S2 in Uint16 as long as possible, subset using indexing instead of sel
    
    * Simplify print statements
    
    * Add and document batch setup
    
    * Add sample as geopackage
    
    Geojson was too big for the linter to be happy
    
    * Small edit on README
    yellowcap authored Dec 6, 2023
    Configuration menu
    Copy the full SHA
    84f4509 View commit details
    Browse the repository at this point in the history
  2. Let LightningDataModule return spatiotemporal metadata (#66)

    * 🗃️ Let LightningDataModule return spatiotemporal metadata
    
    Making the LightningDataModule return not only the image, but also spatiotemporal metadata such as the bounding box, coordinate reference system, and date. The bbox and crs is in the raster image's native UTM projection for now, while the date is just a YYYY-MM-DD formatted string. Unit tests have been updated to ensure that the extra metadata is passed through.
    
    * ♻️ Refactor test_geotiffdatapipemodule to use parametrization
    
    Reduce duplicate code py using pytest.mark.parametrize, looping over fit and predict stages.
    
    * 📝 Document returned outputs from _array_to_torch function
    
    Improved the docstring of the _array_to_torch function, mentioning the input parameters (filepath) and the contents of the output dictionary (image, bbox, crs, date). Also updated the type hint of the function.
    
    * 🚚 Rename crs to epsg
    
    Since we're storing the EPSG integer and not the CRS representation.
    weiji14 authored Dec 6, 2023
    Configuration menu
    Copy the full SHA
    8763bb3 View commit details
    Browse the repository at this point in the history

Commits on Dec 7, 2023

  1. Small pipeline fixes (#72)

    * Remove default for subset
    
    This is easy to forget and then run with subset without the intention to actually subset
    
    * Improve file name
    
    Closes #69
    
    - Zero padding for counter
    - v before version number
    - Underscores instead of hyphon separators
    - Drop hyphons from date stamp
    
    * Bump version to 02
    
    * Make mgrs sample file external
    
    Closes #71
    
    * Add date to raster metadata
    
    Closes #70
    
    * Improve print statement
    yellowcap authored Dec 7, 2023
    Configuration menu
    Copy the full SHA
    018fcbc View commit details
    Browse the repository at this point in the history
  2. Setting the model license to OpenRail-M (#63)

    * Create model-license.md
    
    * [pre-commit.ci] auto fixes from pre-commit.com hooks
    
    for more information, see https://pre-commit.ci
    
    * Rename model-license.md to LICENSE-MODEL.md
    
    ---------
    
    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    brunosan and pre-commit-ci[bot] authored Dec 7, 2023
    Configuration menu
    Copy the full SHA
    c57bcd2 View commit details
    Browse the repository at this point in the history
  3. check for no data on a tile level in sentinel 1 vv and vh, sentinel 2…

    … and DEM (#60)
    
    * check for no data on a tile level in sentinel 1 vv and vh, sentinel 2 and DEM
    
    * adjust to run consecutively instead of all together, prevents unnecesary calculations
    
    * adjust per nodata type in other bands
    
    * Simplify nodata check by converting to loop
    
    ---------
    
    Co-authored-by: Daniel Wiesmann <daniel@wiesmann.pt>
    lillythomas and yellowcap authored Dec 7, 2023
    Configuration menu
    Copy the full SHA
    8afa6de View commit details
    Browse the repository at this point in the history
  4. Improve date handling for data pipeline (#76)

    * Improve date handling for data pipeline
    
    If no match is found for a year, others are being tried until
    a match is found or all years have been tested
    
    * Increase tile size to 512x512 pixels.
    
    Closes #78
    
    * Increase dates per location to 3
    
    Closes #79
    
    * Prevent printing s3 sync upload progress logs
    
    * Move counter above cloud filter to ensure index consistency
    
    Like this the tile IDs in the file names should be consistent across dates.
    
    * Fix typo in comment
    
    * Update batch run setup to new bucket name
    yellowcap authored Dec 7, 2023
    Configuration menu
    Copy the full SHA
    6946ac3 View commit details
    Browse the repository at this point in the history

Commits on Dec 8, 2023

  1. Save embeddings with spatiotemporal metadata to GeoParquet (#73)

    * ✨ Save embeddings with spatiotemporal metadata to GeoParquet
    
    Storing the vector embeddings alongside some spatial bounding box and datetime information in a tabular GeoParquet format, instead of an npy file! Using geopandas to create a GeoDataFrame with three columns - date, embeddings, geometry. The date is stored in Arrow's date32 format, embeddings are in FixedShapedTensorArray, and geometry is in WKB. Have updated the unit test's sample fixture data with the extra spatiotemporal data, and tested that the saved GeoParquet file can be loaded back.
    
    * 📝 Document how embeddings are generated and saved to geoparquet
    
    Improve the docstring of predict_step in the LightningModule on how the embeddings are generated, and then saved to a GeoParquet file with the spatiotemporal metadata. Included some ASCII art and a markdown table of how the tabular data looks like.
    
    * 📝 Mention in main README.md that embeddings are saved to geoparquet
    
    Document that the embeddings are stored with spatiotemporal metadata as a GeoParquet file. Increased batch size from 1 to 1024.
    
    * 🎨 Update type hint of batch inputs, and add some inline comments
    
    Should have updated the type hints in #66, but might as well do it here. Also adding some more inline comments and fixed a typo.
    weiji14 authored Dec 8, 2023
    Configuration menu
    Copy the full SHA
    347e1ae View commit details
    Browse the repository at this point in the history

Commits on Dec 11, 2023

  1. Adapt model to load 512x512 images from s3 bucket (#85)

    * 🔧 Increase image_size from 256 to 512, patch_size from 32 to 64
    
    Increase the chip image size from 256 to 512 pixels, and the patch size from 32 to 64 pixels. Updated the unit test and an assert statement, and fixed a typo.
    
    * 👽 Get YYYY-MM-DD from GeoTIFF tag instead of filename
    
    Obtaining the YYYY-MM-DD date from the GeoTIFF's tag metadata, instead of parsing it from the filename, thanks to the change at 426aa06/#72.
    
    * ✨ Allow GeoTIFFDataModule to get GeoTIFF data from an s3 bucket
    
    New feature to allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Added a unit test that checks that this works to list a GeoTIFF file from s3://copernicus-dem-30m/. Also improved the docstring and type hint of the setup() function's 'stage' parameter.
    
    * 🐛 Add sharding filter before loading GeoTIFF data to torch.Tensor
    
    Need to do this so that the data loading is distributed to the workers, otherwise each worker is doing duplicated work. Also set num_workers to 1 in test_geotiffdatapipemodule to get a consistent result.
    
    * 🙈 Gitignore checkpoints in nested folders
    
    Ensure that *.ckpt files in sub-folders are ignored too.
    
    * ⚡ Set float32 matmul precision to medium
    
    Prevents messages like `You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance.`
    
    * 📝 Mention in main README.md that data_path can be an s3 bucket
    
    Just casually documenting in the main README.md on how one can directly generate embeddings from GeoTIFF files stored in an s3 bucket instead of locally.
    weiji14 authored Dec 11, 2023
    Configuration menu
    Copy the full SHA
    eadfea0 View commit details
    Browse the repository at this point in the history

Commits on Dec 15, 2023

  1. Callback function to log Masked Autoencoder reconstructions to WandB (#…

    …88)
    
    * ➕ Add wandb
    
    A CLI and library for interacting with the Weights and Biases API!
    
    * 🔊 Log Masked Autoencoder reconstructions to WandB
    
    Created a custom callback function to log visualizations of the input and output images to the Masked Autoencoder. Only showing the RGB bands of Sentinel-2 for now. A sample of 6 image pairs (original + reconstructed, so 12 in total) is uploaded to Weights and Biases.
    
    Example LightningCLI command: `python trainer.py fit --trainer.max_epochs=20 --data.data_path=data/32VLM --trainer.logger=WandbLogger --trainer.logger.project=clay --trainer.logger.save_dir=checkpoints --trainer.callbacks+=LogMAEReconstructedImage`.
    
    * ➕ Add scikit-image
    
    Image processing in Python!
    
    * 📸 Apply histogram equalization to RGB images
    
    Enhance low contrast images by applying a histogram equalization stretching algorithm on the RGB images, instead of dividing by a magic number like 6000.
    
    * 🔧 Increase default sample size from 6 to 8
    
    More samples to look at! Also only running einsum conversion on as many samples as needed rather than the whole batch, and handling cases where num_samples may be more than the batch_size.
    
    * 🧑‍💻 Make wandb a somewhat optional dependency
    
    Allows for `from src.callback_wandb import LogMAEReconstruction` to run, even without wandb being installed. Helpful if someone doesn't want to install wandb for whatever reason.
    
    * ✅ Add unit test for LogMAEReconstruction
    
    Testing that the LogMAEReconstruction callback works to save a set of images to WandB. Testing this in offline mode only, with checks that artifacts are saved locally, and that the wandb images have the correct caption and format.
    
    * 🐛 Compare expected folders using set instead of list
    
    Order of the folders could change, so using set instead of list.
    
    * 🧪 Prevent WandB logger from saving logs to local drive for now
    
    Setting WANDB_MODE="disabled", so no files are logged to disk, though the wandb.Image(s) are still created. See if this helps to resolve the exit code 255 issue on GitHub Actions.
    
    * 📝 Fix a typo and improve docstring
    
    Minor changes to the docstring of the on_validation_batch_end method, and a typo fix.
    weiji14 authored Dec 15, 2023
    Configuration menu
    Copy the full SHA
    f9fe458 View commit details
    Browse the repository at this point in the history
  2. Implement MAE with support for position, time, latlon & channel embed…

    …dings (#47)
    
    * Add modified ViT to encode latlon, time, channels & position embeddings
    
    * Add MAE for modified ViT
    
    * Add docstrings & fix issue with complex indexing
    
    * Fix the comments on loss computation
    
    * Add datamodule & trainer to run an epoch of training
    
    * Normalize data before feeding to the model
    
    * Add fixed sincos embedding for position & bands
    
    * Add logging & ckpt options
    
    * Fix the order of coords from lat,lon to lon,lat
    
    * Add clay tiny,small,medium,large model versions
    
    * Remove hardcoded patch size in LogIntermediatePredictions callback
    
    Retrieve the patch size value from the model architecture, rather than hardcoding as 32. Also ensure that the input image shape is the same as the predicted image from the decoder.
    
    * Run clay small on image size 512 for 10 epochs with grad_acc
    
    * Make the clay construction configurable
    
    * Return the data path to reference for vector embeddings
    
    * Remove duplicate dataset.py & geovit.py
    
    * 🔀 Merge srm_trainer.py into trainer.py
    
    Have one entrypoint to run the model using Lightning CLI. Switched model from VitLitModule to CLAYModule, and datamodule from GeoTIFFDataPipeModule to ClayDataModule. Temporarily disabling the logging and monitoring callbacks for now.
    
    * 🔀 Combine clay.py and model.py into model_clay
    
    Putting the CLAYModule (LightningModule) together with the CLAY torch.nn.Module in a single model_clay.py file. Have mentioned in src/README.md that model_clay.py is the one with custom spatiotemporal encoders, while the previous model_vit.py contains vanilla Vision Transformer implementation.
    
    * ➕ Add matplotlib-base
    
    Publication quality figures in Python!
    
    * 🚚 Move ClayDataset and ClayDataModule into datamodule.py
    
    Putting the DataLoader code in one file - datamodule.py. The regular torch Dataset classes are placed on top of the existing torchdata-based functions/classes.
    
    * 🚚 Move LogIntermediatePredictions callback into callbacks_wandb
    
    Moving the LogIntermediatePredictions callback class from callbacks.py into callbacks_wandb.py.
    
    * ♻️ Get WandB logger properly using a shared function
    
    Getting the WandbLogger directly from the trainer, rather than having to pass it through __init__. Adapted from https://github.com/ashleve/lightning-hydra-template/blob/334601c0326a50ff301fbd76057b36408cf97ffa/src/callbacks/wandb_callbacks.py#L16C1-L34C6
    
    * 🚨 Wrap docstring and fix too-many-arguments lint error
    
    Converted docstrings from numpydoc style which uses less horizontal space but more vertical space. Also added a noqa comment for three instances of `PLR0913 Too many arguments in function definition`.
    
    ---------
    
    Co-authored-by: SRM <soumya@developmentseed.org>
    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
    Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
    4 people authored Dec 15, 2023
    Configuration menu
    Copy the full SHA
    c914913 View commit details
    Browse the repository at this point in the history

Commits on Dec 17, 2023

  1. Initialize Jupyter Book documentation (#89)

    * ➕ Add jupyter-book
    
    Build a book with Jupyter Notebooks and Sphinx!
    
    * 📝 Initialize Jupyter Book
    
    Starting with a minimally modified Jupyter Book initialized with `jupyter-book create docs/`. Changed the `_config.yml` to use a proper title and the Clay logo. Included a Binder launch button and a footer with CC-BY-4.0 license.
    
    Deleted the sample notebooks.ipynb and markdown-notebooks.md files, and excluded the book/requirements.txt (dependencies will be installed from environment.yml). Put in a placeholder installation page for now in the Table of Contents.
    
    * 🚀 Deploy Jupyter Book to GitHub Pages via GitHub Actions
    
    Continuous Integration workflow to build the Jupyter Book's html pages and publish it online to GitHub Pages. Based on https://jupyterbook.org/en/stable/publish/gh-pages.html#automatically-host-your-book-with-github-actions, but modernized to use GitHub Actions based publishing source, see https://github.blog/changelog/2022-07-27-github-pages-custom-github-actions-workflows-beta
    
    * 📝 Add 'About Clay' section with links to GitHub and LinkedIn pages
    
    Add external links to Clay's GitHub organization page, and LinkedIn.
    
    * 🔍 Add badges to main README.md
    
    Add badges pointing to the Jupyter Book page, and for the deploy-book.yml/test.yml GitHub Action statuses below the title in the main README.md page. Also modified the description into something more compelling.
    weiji14 authored Dec 17, 2023
    Configuration menu
    Copy the full SHA
    bf58fae View commit details
    Browse the repository at this point in the history

Commits on Dec 19, 2023

  1. Rename embeddings file to include MGRS code and store GeoTIFF source_…

    …url (#86)
    
    * 🗃️ Store source_url of GeoTIFF to GeoParquet file
    
    Passing the URL or path of the GeoTIFF file through the datapipe, and into the model's prediction loop. The geopandas.GeoDataFrame now has an extra 'source_url' string column, and this is saved to the GeoParquet file too.
    
    * 🚚 Save one GeoParquet file for each unique MGRS tile
    
    For each MGRS code (e.g. 12ABC), save a GeoParquet file with a name formatted like `{MGRS:5}_v{VERSION:2}.gpq`, e.g. 12ABC_v01.gpq. Have updated the unit test to check that rows with different MGRS codes are saved to different files.
    
    * ⚡ Save GeoParquet file with ZSTD compression
    
    Using ZStandard compression instead of Parquet's default Snappy compression. Should result is slightly smaller filesizes, and slightly faster data transfer and compression (especially over the network). Also changed an assert statement to an if-then-raise instead.
    
    * ♻️ Predict with multiple workers and gather results to save
    
    Speed up embedding generation by enabling multiple workers to fetch and load mini-batches of GeoTIFF files independently, and run the prediction. The prediction or generated embeddings from each worker (a geopandas.GeoDataFrame) is then concatenated together row-wise, before getting passed to the GeoParquet output script. This is done via LightningModule's `on_predict_epoch_end` hook. Also documented these new processing steps in the docstring.
    weiji14 authored Dec 19, 2023
    Configuration menu
    Copy the full SHA
    083ce4c View commit details
    Browse the repository at this point in the history
  2. Let ClayDataModule return same spatiotemporal fields as GeoTIFFDataMo…

    …dule (#91)
    
    * 🔧 Standardize on a data_dir parameter with a str type
    
    The GeoTIFFDataModule was using data_path:str, while ClayDataModule was using data_dir:Path. Standardize both to be data_dir:str instead. Some parts of this commit is adapted from 1009697.
    
    Also placed all the ClayDataModule's setup logic under `if stage=='fit'`, to reduce diff when predict step is implemented later.
    
    * 🎨 Get YYYY-MM-DD from GeoTIFF tag rather than filename
    
    More robust way of obtaining the Sentinel-2 imagery's acquisition date. Also returning the date in the datacube now.
    
    * 🎨 Simplify lonlat centroid calculation and return UTM bbox/epsg
    
    Can use rasterio's built-in lnglat() method to get the geographic center of the chip, instead of calculating it manually. See https://rasterio.readthedocs.io/en/latest/api/rasterio._base.html#rasterio._base.DatasetBase.lnglat Also returning the original UTM bounding box and EPSG code in the datacube.
    weiji14 authored Dec 19, 2023
    Configuration menu
    Copy the full SHA
    ba1bba5 View commit details
    Browse the repository at this point in the history
  3. Allow ClayDataModule to load GeoTIFF files directly from s3 (#92)

    * ✨ Allow ClayDataModule to get GeoTIFF data from an s3 bucket
    
    Allow passing in a URL to an s3 bucket, and loading the GeoTIFF data from there directly. Using the same torchdata based code for the s3 pathway as with commit f288eb8 in #85.
    
    * 🚚 Rename datacube's path key to source_url
    
    Using the same 'source_url' key in the returned datacube dictionary for both ClayDataModule and GeoTIFFDataPipeModule
    
    * 🚑 Use try-except to get absolute chip_path or fallback to str
    
    The getattr doesn't actually work properly, since we need to call chip_path.absolute() with brackets. Using a good ol' try-except statement instead, with the fallback being just the plain chip_path str (for s3 URLs).
    
    * ✨ Implement predict_dataloader for ClayDataModule
    
    Similar to the train/val dataloaders, but shuffling and pin_memory are both disabled.
    
    * ✅ Add parametrized test for checking ClayDataModule
    
    Ensure that outputs of both ClayDataModule and GeoTIFFDataPipeModule are the same-ish. Needed to make the split_ratio in ClayDataModule configurable, and check sorted list outputs instead of unsorted outputs for determinism. Fixed some hardcoded tensor shapes/dtypes, and dictionary keys too. Removed the nan_to_num casting of the image pixels in ClayDataModule so that int16 dtype inputs are accepted.
    
    * 📝 Edit docstrings in test_datamodule.py to be more generic
    
    Not just testing one, but two different LightningDataModules now!
    
    * 🔧 Add GDAL environment variables that might help with s3 loading
    
    Setting GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES is supposed to improve GDAL performance when reading Cloud-Optimized GeoTIFFs. See https://gdal.org/user/configoptions.html.
    weiji14 authored Dec 19, 2023
    Configuration menu
    Copy the full SHA
    904e043 View commit details
    Browse the repository at this point in the history
  4. ignore datadisk

    brunosan committed Dec 19, 2023
    Configuration menu
    Copy the full SHA
    d2143af View commit details
    Browse the repository at this point in the history

Commits on Dec 20, 2023

  1. Refactor model for multi-device usage and easier disabling of masking (

    …#95)
    
    * ♻️ Better handle pos and band encodings across multi-devices
    
    Move the pos_encoding and band_encoding layers to the correct device in a way that allow Lightning to do multi-gpu properly. The reported loss is now synced or reduced/averaged across multiple devices too. Partially cherry-picked from 1a40f56
    
    Co-Authored-By: SRM <soumya@developmentseed.org>
    
    * ♻️ Compute num_masked_patches dynamically based on mask_ratio
    
    So that the masking can be turned off during prediction using `self.model.encoder.mask_ratio = 0`, where self is an instance of CLAYModule. The num_masked_patches integer value is now calculated on-the-fly by multiplying mask_ratio with num_patches.
    
    * 🎨 Register pos_encoding and band_encoding properly on device
    
    Since the pos_encoding and band_encoding tensors are declared in the __init__ method, we'll need to register them so that they are moved to the correct device by Lightning during the forward call. See https://lightning.ai/docs/pytorch/2.1.0/starter/converting.html#remove-any-cuda-or-to-device-calls
    
    ---------
    
    Co-authored-by: SRM <soumya@developmentseed.org>
    weiji14 and SRM authored Dec 20, 2023
    Configuration menu
    Copy the full SHA
    a61185f View commit details
    Browse the repository at this point in the history

Commits on Dec 22, 2023

  1. Document how to generate vector embeddings (#98)

    * 📝 Document how to generate vector embeddings
    
    Step by step instructions on how to produce embeddings from the pretrained model. From checking that one has permissions to get the GeoTIFF files, to downloading of the model checkpoint, and running the model prediction to get the GeoParquet output. Also gave a tip on what a suitable VM instance would be like.
    
    * 📝 Document details of how the mean embeddings were computed
    
    Extra technical details on how the raw (B, 1538, 768) embeddings are turned into (B, 768) shaped embeddings by taking the mean along the spatial patches.
    
    * 📝 Document format of the GeoParquet table and how to read it
    
    Useful details about the filename convention and table schema of the embeddings stored in GeoParquet format, and some sample GeoPandas code showing how to read a *.gpq file. Also linking to some guides and resources from the Cloud Native Geospatial Foundation.
    
    * ✏️ Typo embedding -> embeddings
    
    Never sure whether it's singular or plural.
    weiji14 authored Dec 22, 2023
    Configuration menu
    Copy the full SHA
    27774b6 View commit details
    Browse the repository at this point in the history
  2. Document how to finetune pretrained model on downstrem task (#99)

    * 📝 Document how to finetune pretrained model on downstrem task
    
    Explaining how the pre-trained model can be finetuned after attaching a head to the network. Written by Lilly.
    
    ---------
    
    Co-authored-by: Lilly Thomas <lilly@developmentseed.org>
    weiji14 and lillythomas authored Dec 22, 2023
    Configuration menu
    Copy the full SHA
    d658bfd View commit details
    Browse the repository at this point in the history
  3. Document how the benchmark dataset labels were prepared (#100)

    📝 Document how the benchmark dataset labels were prepared
    
    Mention why we decided to use Cloud to Street for the initial benchmark dataset, and how the imagery and label data was processed to fit into the Clay Foundation model. Written by Lilly.
    
    Co-authored-by: Lilly Thomas <lilly@developmentseed.org>
    weiji14 and lillythomas authored Dec 22, 2023
    Configuration menu
    Copy the full SHA
    452e54d View commit details
    Browse the repository at this point in the history

Commits on Dec 27, 2023

  1. squash fix

    This is fully Soumyas work just pushing here for simplicity to share at the moment.
    
    squash fix
    
    nbdev init
    
    added basic read
    
    test rendered
    
    Start release doc
    
    rendered docs
    
    better name for lib
    
    WIP v0 release docs
    
    rendered outout
    
    ignore local sync
    
    old docs fully ported
    
    WIP
    
    generated output
    
    generated output
    
    output generated
    
    test new nbdev
    
    geenraed output
    
    test custom deploy of nbdev docs
    
    add jobs
    
    no description
    
    no inputs
    
    add geopandas
    
    add tqdm
    
    more deps
    
    small cleanup
    
    docs are inside /docs
    
    weird relatives paths
    
    WUP
    
    rel path
    
    trying new workflow script
    
    write permission
    
    bad params
    
    dark color top
    
    I think I can ignore locally generated static content
    
    dark
    
    rename core to model
    
    renamed
    
    GH pages setting
    
    new workflow
    
    path
    
    path
    
    checkout workflow
    
    paths
    
    ../
    
    actions
    
    action
    
    docs/_docs/
    
    ignore rendered output
    
    ignore rendered output
    
    stub
    
    clearer
    
    move relase to roadmap, fix outputs
    
    add image
    
    Revert "Add minimal version of clayground"
    
    This reverts commit 6e2aac2.
    
    Remove all files from docs/docs
    
    leave root unchanged
    yellowcap authored and brunosan committed Dec 27, 2023
    Configuration menu
    Copy the full SHA
    3483be2 View commit details
    Browse the repository at this point in the history