Improving Similarity Search in Earth Observation Data #278

brunosan · 2024-06-11T23:25:45Z

brunosan
Jun 11, 2024
Maintainer

Given a set of examples [positives and begatives] of geospatial embeddings, what is the best way to find similar ones?

Context

In Earth Observation (EO) data, identifying similar features from satellite images is crucial. Embeddings are vector summaries of the images with great potential. In text applications, embeddings are often used via "cosine similarity", but in EO I believe we have unique challenges and opportunities we should explore.

UX consideration

Many of our use case favors user experience, so we are less tolerant to false negatives than false positives. That is, we rather miss cases than show irrelevant ones. precision > recall.

Why EO embeddings are different:

EO embeddings are strictly anchored to pixel values, via the linear projection of the pixel values. Text tokens start fully random. Hence the EO embeddings are both more constrained, but constrained to real "anchors" on the ground.
The "noise" in EO embeddings, e.g. clouds, haze, temporal changes, are significant enough that their semantic might be incorporated.
EO embeddings are intrinsically very polysemic. An image chip contains many possible semantics (car, road, grass, ...). Moreover these semantics are often related in similar ways, which makes pruning into a specific semantic of interest harder.
Semantics are not tokenizable or orthogonal. We can only change the size of the image, whose relation to the size of the semantics is much more complex than test. Rhere is a also very complex (but I'd say static) interrelation of EO semantics.
Interpolations on the embedding space might be too weird. E.g. sand+water=beach but the average of a house might be a square, not a house.

Status quo

We've seen, and explored different current options, taken mostly from NLP:

Simple positive Average

Combine average positive embeddings hoping to average out non intended semantics.

E.g.
If we combine:

Chip with a house, car, grass

Chip with a house

Chip with a house and water
The average will point towards ~3x house and 1/3 others.

This also works if we purposefully combine intended semantics. E.g. we add a desert tile with a water tile, to find beach coasts.

Pseudo Code

average(cosine_similarity(positive_embeddings))

Assessment

Very basic, there should be room for much improvement. Needs a lot of samples to weed out common associations of semantics.

Simple Net Average

We use this one on the Explore app as default.

Combine average positive embeddings and substract average negative samples hoping to average out non intended semantics and remove unintended ones.

E.g.
If we combine as positives:

Chip with a house, car, grass

Chip with a house

Chip with a house and water
If we combine as negatives:

Chip with a car, grass

Chip with a grass
The average will point much more strongly to the location of house, than the rest.

Note: It is unclear to me if one can have "net 0" semantics. e.g. +house - house = 0.

Pseudo Code

average(cosine_similarity(positive_embeddings)) - average(cosine_similarity(negative_embeddings))

Common Similar:

We use this one on the Explore app as alternative.

The goal here is to find the most common embeddings similar to all positive samples, subtracting the most common embeddings similar to the negatives.

This methos doesn't use averages to it doesn't traverse to locations in the embedding space outside of real samples. It basically weights all candidates for how close they are to each positive sample and far from negative ones.

Note> This method scales really poorly.

Pseudo Code

positive_scores = compute_cosine_scores(all_embeddings, positive_embeddings)
negative_scores = compute_cosine_scores(all_embeddings, negative_embeddings)

positive_scores - negative_scores (element-wise)

Other method we've tried.

There are plenty of other methods we've tried.

The most promising, but ended up being no better was "Random Forest prunning". Basically, we took the positive examples, and made a Random Forest classifier. With it, we can now associated each dimension of the embedding a "feature importance" for the classifier. We can then drop the less relevant dimensions of the embedding and use only those in a refined cosine similarity of the same positive sample. Oddly, performance was always worse than pure cosine similarity.

--

Random Forest: A Random Forest is a bundle of decision trees (which are random splits of the data based on the features) that are trained on different subsets of the data. The Random Forest is a good classifier for this task because it is robust to overfitting and can handle a large number of features.
Cosine Similarity: We calculate the cosine similarity of the positive cases to the full set, and the negative cases to the full set. We then retain the top common positive cases and remove the top common negative cases.
Cosine Bundle: We calculate the cosine similarity of the positive cases to the full set, and the negative cases to the full set. We then classify depending on the net total similarity to the positive cases minus the negative cases. This is a more complex version of the cosine similarity method.
Euclidean Distance: We calculate the Euclidean distance of the positive cases to the full set, and the negative cases to the full set. We then retain the top common positive cases and remove the top common negative cases.
Semi-Supervised: We use the LabelSpreading algorithm to predict the labels of the positive cases based on the negative cases. This algorithm is a variant of the LabelPropagation algorithm, which is a semi-supervised learning algorithm that uses a graph-based approach to propagate labels from the labeled data to the unlabeled data.
Isolation Forest: We use the Isolation Forest algorithm to predict the labels of the positive cases based on the negative cases. This is a type of "one class" clasifier. This algorithm is an unsupervised learning algorithm that uses an ensemble of decision trees to isolate the anomalies in the data.

clkruse · 2024-06-12T05:08:41Z

clkruse
Jun 12, 2024

From my experience, this is the most important question for all of these embedding-based search products. Excited to see what you can come up with!

One path that I started down, but never worked through, was ColBERT. A number of potentially-useful utilities here https://github.com/raphaelsty/neural-cherche

0 replies

wronk · 2024-06-12T05:08:57Z

wronk
Jun 12, 2024

Semi-outdated, but here are some potentially interesting references. Some folks at Descarte made some progress here, but note that I think they focused more on the engineering (how to scale up the search when traditional vector algebra is too slow). I'm not sure if you will/have hit that issue yet, but posting here regardless in case there is other useful lessons learned.

I'm less familiar with the cutting edge vector database work that's undergirding some of the recent LLM/RAG advances, but that's it's possible there is some fruitful advice in that literature. They are developing techniques for exploring/manipulating the semantic embedding space when doing retrievals.

0 replies

MaxLenormand · 2024-06-12T06:19:47Z

MaxLenormand
Jun 12, 2024

I'm not sure if this directly applies, as they aren't directly creating embeddings from what I understand, but EPFL's recent paper on Meta-learning to address diverse Earth observation problem across resolutions has some interesting points:

We also generate MOSAIKS29 features dynamically for each downstream task from the training data and predict the test data with a random forest classifier. Models pre-trained with RGB data, i.e., SECO, SWAV, DINO, and IMAGENET, are only able to process the three RGB channels, while the remaining models (METEOR, SSLRS-R50, SSL4EO, MOSAIKS, BASELINE) have access to all 13 Sentinel-2 bands present in the DFC2020 data.

This is only a a much smaller amount of known classes though, so a simple random forest might be enough in their case.

0 replies

brunosan · 2024-06-12T16:48:09Z

brunosan
Jun 12, 2024
Maintainer Author

@konstantinklemmer rightfully points to their SatClip paper, and the fact that we can directly do similarities of embeddings of pure locations trained on net that was fed Sentinel-2 data. We reviewed the paper here #57.

My hunch is that this is great for overal picture similarity, where context is most important (type of forest, or road, ...), but I suspect that we will need a pixel embedding to point to specific semantics that exist across contexts, like houses or rivers, ... Perhaps a SatClip fed NAIP and LINZ would expand to those semantics?...

In practice, it might make sense to add the SatClip location embedding and weight that into the mix?

1 reply

konstantinklemmer Jun 12, 2024

I agree @brunosan. SatCLIP embeddings will give you a good macro-comparison, i.e. "Which areas around the world look similar to location [lon, lat]". But SatCLIP certainly does not capture small-scale semantics. Your idea of using SatCLIP (or other macro-level embeddings) for weighing is exactly right IMO. Basically the idea of a geographic prior informed by macro-scale location features extracted from Sentinel-2 imagery.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Similarity Search in Earth Observation Data #278

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Improving Similarity Search in Earth Observation Data #278

brunosan Jun 11, 2024 Maintainer

Context

UX consideration

Why EO embeddings are different:

Status quo

Simple positive Average

Pseudo Code

Assessment

Simple Net Average

Pseudo Code

Common Similar:

Pseudo Code

Other method we've tried.

Replies: 4 comments · 1 reply

clkruse Jun 12, 2024

wronk Jun 12, 2024

MaxLenormand Jun 12, 2024

brunosan Jun 12, 2024 Maintainer Author

konstantinklemmer Jun 12, 2024

brunosan
Jun 11, 2024
Maintainer

Replies: 4 comments 1 reply

clkruse
Jun 12, 2024

wronk
Jun 12, 2024

MaxLenormand
Jun 12, 2024

brunosan
Jun 12, 2024
Maintainer Author