SatCLIP paper. #57

brunosan · 2023-11-30T20:47:20Z

brunosan
Nov 30, 2023
Maintainer

@calebrob6 et al just released SatCLIP. I've only skimmed it so far but posting here to see what we can learn for, probably, Clay v1.

Let's also consider brainstorming with Caleb at some point to get his view. When I was leading AI for Earth and him on AI for Good, he always had really good insights (e.g. What we built to be Microsoft PEARL grew out of an MVPs he made for one of his projects).

cc @yellowcap @srmsoumya @weiji14

danhammer · 2023-12-07T17:16:51Z

danhammer
Dec 7, 2023
Maintainer

I've reviewed the paper. The graphic is a helpful reference.

1 reply

danhammer Dec 7, 2023
Maintainer

My takeaway:

User interface is critical.
Different zoom levels are critical for use.
Larger models will actually yield more useful results.

calebrob6 · 2023-12-07T18:15:39Z

calebrob6
Dec 7, 2023

(thanks for the kind words Bruno😄)

Happy to answer any questions!

And here's the code -- https://github.com/microsoft/satclip

cc' @konstantinklemmer who led the project!

0 replies

brunosan · 2024-01-19T09:14:25Z

brunosan
Jan 19, 2024
Maintainer Author

Finally had time to digest it. It's SO CLOSE to the location-aware spec we talked about for v1 @srmsoumya @yellowcap

A one liner summary for me would be: Turn any lat-lon into a semantic embedding, learnt from images, but expandable to learn from any geolocated data (text, POIs, ...)

CLIP is a training goal. That is, a method to create gradients that update the weights of the model that create the embeddings. The updates aim towards a goal. In this case, we have two models, one that goes from lat-lon to embeddings, and another that goes from images to embeddings. Now, CLIP takes a batch of pairs of embeddings (that is, each pair of the batch is made of 1) the embedding of a specific latlon, and 2) the embedding of the image of that location), and updates the weights of both models so that the embeddings of the matching pair are closest (cosine similarity) and at the same time the distance to the rest is furthest. The largesr the batch, the harder the solution, too big and it won't learn much. Too small, and it will not really have much to contrast with. (CLIP for text seems best around 32k batch size, SATClip founds 8k better for them).

The latlon encoder then becomes an encoder also of the image contents. Once trained, this encoder has learnt what to expect given ONLY the lat and lon.

The end result is a location enconder that can be finetuned to predict with VERY HIGH accuracy Air Temperature, Elevation, Median Income, California
Housing prices, Population Density, ... All just using latlon as input.

Training

500 epochos on a single A100, with batch size 3k, of 100K patches of 256x256 Sentinel-2 with all 10 bands at 10m/px.

Location extent:

We also need an extent for that location (Saying latlon 23,24 needs also an extent). They use Legendre polynomials, which I'm going to rudely summarize as complete spherical components to describe anything on Earth Surface. The more polynomials, the more resolution. SatCLIP trained a coarsers L=10 and a finer L=40.

Note: I've amended a 40 on the right figure title, as I believe that's a typo.

Data augmentation

It was surprising to see they use flips and crops for data augmentation of images. I would have expected that if the goal is to encode location, one should not mess with the location of pixels.

Questions for Clay:

Could be use the Clay v0 embeddings and use CLIP to finetune text embeddings to match the v0 embeddings? I.e. a "half-CLIP" so we don't need to re-train Clay v0. If that is the case, we could "attach" modes to Clay, E.g. half-CLIP text, half-CLIP POIs, half-CLIP ...
Could be map Clay v0 embeddings to SATClip embeddings? One would imagine they are the same conceptually, but with different methods.

0 replies

konstantinklemmer · 2024-01-21T21:11:04Z

konstantinklemmer
Jan 21, 2024

Thanks for tagging me @calebrob6!

And thanks everyone for checking our paper. Let me know if there's any questions. A few quick thoughts:

You are correct @brunosan, there is a typo in that figure. It is $L=40$ for the right side figure. What these figures show really nicely is what the location encoder returns in areas with no training data at all (i.e. all oceans since we only have training data over landmass). The model basically defaults to the pure positional encoding which is the spherical harmonics functions so you can see nicely here how over the oceans embeddings look vastly different depending on the number of Legendre polynomials we use for the positional encoding.
RE data augmentation, IIRC this did not make a huge difference for our training. And we don't make assumptions about exact pixel locations - we only assume that an image represents the location AROUND the image centroid coordinate - so in that sense flipping the image only makes a small difference (e.g. if there is a clear south-to-north pattern in the image that would be a problem but the images are so high-resolution that we can safely assume that no global patterns are within one image really)
RE Your questions for Clay: (1) I'd be interested in those results - intuitively this does make sense to me. (2) Clay includes Sentinel-2 right? So I don't see why that wouldn't be possible.

3 replies

calebrob6 Jan 22, 2024

Great summary Bruno (and congrats on the launch at Davos; I've been following with great interest 🙂)

I'll add a few points too:
We kept the image encoder frozen in the current set of experiments. Using clay v0 as the encoder (and the larger dataset you all have sampled for training the location encoder) would be a great experiment.

The end result is a location enconder that can be finetuned to predict with VERY HIGH accuracy ...

We didn't even fine-tune, but trained a small MLP on top of the embeddings (@konstantinklemmer can validate) -- this is very fast to do.

I generally think of the location encoder model as trying to compress the information stored in the other training modalities (here is imagery) and learning the relative differences between lat/lons.

konstantinklemmer Jan 22, 2024

Yep, using Clay v0 as encoder would make sense.
Yep, it's an MLP on top of the 256-dim embeddings. We don't fine-tune location encoder weights during experiments. We should probably change the language in the paper to be exact.

brunosan Jan 22, 2024
Maintainer Author

Fascinating and exciting. So thankful for your comments here!!!

That means testing CLIP with Clay v0 is not a bad idea. Also interesting that #131 (Figure 5) suggests the opposite, that freezing either the encoder or decoder are big hits (especially the image encoder).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SatCLIP paper. #57

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

SatCLIP paper. #57

brunosan Nov 30, 2023 Maintainer

Replies: 4 comments · 4 replies

danhammer Dec 7, 2023 Maintainer

danhammer Dec 7, 2023 Maintainer

calebrob6 Dec 7, 2023

brunosan Jan 19, 2024 Maintainer Author

Training

Location extent:

Data augmentation

Questions for Clay:

konstantinklemmer Jan 21, 2024

calebrob6 Jan 22, 2024

konstantinklemmer Jan 22, 2024

brunosan Jan 22, 2024 Maintainer Author

brunosan
Nov 30, 2023
Maintainer

Replies: 4 comments 4 replies

danhammer
Dec 7, 2023
Maintainer

danhammer Dec 7, 2023
Maintainer

calebrob6
Dec 7, 2023

brunosan
Jan 19, 2024
Maintainer Author

konstantinklemmer
Jan 21, 2024

brunosan Jan 22, 2024
Maintainer Author