SatCLIP paper. #57
Replies: 4 comments 4 replies
-
I've reviewed the paper. The graphic is a helpful reference. |
Beta Was this translation helpful? Give feedback.
-
(thanks for the kind words Bruno😄) Happy to answer any questions! And here's the code -- https://github.com/microsoft/satclip cc' @konstantinklemmer who led the project! |
Beta Was this translation helpful? Give feedback.
-
Finally had time to digest it. It's SO CLOSE to the location-aware spec we talked about for v1 @srmsoumya @yellowcap A one liner summary for me would be: Turn any lat-lon into a semantic embedding, learnt from images, but expandable to learn from any geolocated data (text, POIs, ...) CLIP is a training goal. That is, a method to create gradients that update the weights of the model that create the embeddings. The updates aim towards a goal. In this case, we have two models, one that goes from lat-lon to embeddings, and another that goes from images to embeddings. Now, CLIP takes a batch of pairs of embeddings (that is, each pair of the batch is made of 1) the embedding of a specific latlon, and 2) the embedding of the image of that location), and updates the weights of both models so that the embeddings of the matching pair are closest (cosine similarity) and at the same time the distance to the rest is furthest. The largesr the batch, the harder the solution, too big and it won't learn much. Too small, and it will not really have much to contrast with. (CLIP for text seems best around 32k batch size, SATClip founds 8k better for them). The latlon encoder then becomes an encoder also of the image contents. Once trained, this encoder has learnt what to expect given ONLY the lat and lon. The end result is a location enconder that can be finetuned to predict with VERY HIGH accuracy Air Temperature, Elevation, Median Income, California Training500 epochos on a single A100, with batch size 3k, of 100K patches of Location extent:We also need an extent for that location (Saying latlon 23,24 needs also an extent). They use Legendre polynomials, which I'm going to rudely summarize as complete spherical components to describe anything on Earth Surface. The more polynomials, the more resolution. SatCLIP trained a coarsers L=10 and a finer L=40. Note: I've amended a Data augmentationIt was surprising to see they use flips and crops for data augmentation of images. I would have expected that if the goal is to encode location, one should not mess with the location of pixels. Questions for Clay:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for tagging me @calebrob6! And thanks everyone for checking our paper. Let me know if there's any questions. A few quick thoughts:
|
Beta Was this translation helpful? Give feedback.
-
@calebrob6 et al just released SatCLIP. I've only skimmed it so far but posting here to see what we can learn for, probably, Clay v1.
Let's also consider brainstorming with Caleb at some point to get his view. When I was leading AI for Earth and him on AI for Good, he always had really good insights (e.g. What we built to be Microsoft PEARL grew out of an MVPs he made for one of his projects).
cc @yellowcap @srmsoumya @weiji14
Beta Was this translation helpful? Give feedback.
All reactions