Similarity search and the location/timestep embeddings #212

mikeskaug · 2024-04-04T21:56:32Z

mikeskaug
Apr 4, 2024

I have a question about the location and timestep information added to the input sequence and how it might be related to using the encoder outputs for calculating similarity between chips. For my use-case, I'm interested in comparing the similarity between chips based only on the pixel values in the data cube (spectral, SAR and DEM). My understanding is that by including the latlon and timestep in the input sequence that the encoder output will contain the location and time information somewhere in the embedding space. Does that sound right? So if I wanted to compute the similarity based only on the pixel values do I need to train a new model that doesn't use latlon and timestep? Is it as simple as excluding the last two vectors in the embedding? Does this question make sense?

I'm new to the transformer architecture so I did an experiment to make sure I was thinking about this right. Some of the tutorials suggest using parts of the embedding to either focus on or exclude the location and timestep (for example: https://clay-foundation.github.io/model/tutorial_digital_earth_pacific_patch_level.html). What I wanted to check was whether excluding the last 2 vectors in the embedding would eliminate location and timestep information.

I passed 100 chips through the encoder like normal.
Extracted all but the last two vectors and took the mean over the patches to get one embedding per chip:

chip_embeddings = embeddings[:, :-2, :].mean(axis=1)

Used PCA to visualize the first 2 components:

Again passed the 100 chips through the encoder but manually set all the latlon values to random locations around the world. Otherwise the inputs are the same as before.
Repeated 2 and 3:

Even when excluding the last two vectors in the embedding, the embeddings look different. Although not very different. Maybe this means that the first 1536 vectors are mostly capturing information in the pixels and it's okay to use for calculating similarity?

Answered by brunosan

Apr 5, 2024

That is correct. Clay is designed to be aware of time and location. The idea is that "AI for Earth" should be aware of Earth not images unthetered like other image models. (the intuition is that temporal and spatial proximity are relevant e.g. Madrid doesn't move day to day).

If you do have a use case that needs the model where that is not the case, my hunch is that the easiest thing to do is scrap those inputs. All inputs will be optional in V1 due in May.

The above is not the case for v0.1 that you used here. v0.1 and v0.2 that we will release in a week or os do need complete inputs.

The "good news" is that we estimate that the model is not really paying much attention to location or ti…

View full answer

brunosan · 2024-04-05T04:07:12Z

brunosan
Apr 5, 2024
Maintainer

That is correct. Clay is designed to be aware of time and location. The idea is that "AI for Earth" should be aware of Earth not images unthetered like other image models. (the intuition is that temporal and spatial proximity are relevant e.g. Madrid doesn't move day to day).

If you do have a use case that needs the model where that is not the case, my hunch is that the easiest thing to do is scrap those inputs. All inputs will be optional in V1 due in May.

The above is not the case for v0.1 that you used here. v0.1 and v0.2 that we will release in a week or os do need complete inputs.

The "good news" is that we estimate that the model is not really paying much attention to location or time currently. The reason being that the input coverage is just too coarse. The % of space and time we cover is extremely small, so the model doesn't have much opportunity to leverage these dependencies.

We are still learning to understand the model. If the above is all correct, my take for your need is that the embeddings do change in all dimensions VERY slightly, especially if the locations set randomly happen to be close to those in the training set, which might bias the model towards what it "expects" to see. I do not understand how the last 2 dimensions would contain the location and time (It might be the case, but I personally do not understand how)

Makes sense?

1 reply

mikeskaug Apr 5, 2024
Author

Yes, that makes sense. And I totally understand the intuition about including metadata about location and time, which for many applications like classification and segmentation, would probably be important.

For similarity search (or my end goal which is related), I was thinking about geographic generalizability and the potential bias introduced by location and time information. For example, what if I only had examples of airports in southwestern US and I want to find similar examples in northern Australia. Even if they are spectrally similar, would the model consider them dissimilar because of the latitude/longitude? Maybe I'm overthinking this and the signal from location and time would not be that strong.

Thanks for the information on future models and your experience so far interpreting the embeddings. The location having a small effect on the embeddings would also be my interpretation based on the experiment I did.

I also didn't understand why the last 2 elements of the embedding would contain location and time but a couple of the tutorials seemed to imply that. That's why I did this experiment to make sure I was understanding correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Similarity search and the location/timestep embeddings #212

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Similarity search and the location/timestep embeddings #212

mikeskaug Apr 4, 2024

Replies: 1 comment · 1 reply

brunosan Apr 5, 2024 Maintainer

mikeskaug Apr 5, 2024 Author

mikeskaug
Apr 4, 2024

Replies: 1 comment 1 reply

brunosan
Apr 5, 2024
Maintainer

mikeskaug Apr 5, 2024
Author