A simple per-pixel regression or classification head using per-pixel scale embeddings? #305

lzachmann · 2024-07-23T21:51:41Z

lzachmann
Jul 23, 2024

Hey Clay team! I've been following the rapidly emerging / growing set of use cases. (Kudos, by the way, these are very cool and very helpful examples!) I was curious to hear if anyone on the team has considered developing a simple regression or classification model using Clay's embeddings, but – critically for my use case – at the resolution of the input imagery rather than the resolution of the patches, which are 8x the scale of the inputs? For instance, if we're using Sentinel-2 imagery, we'd be targeting a 10m per-pixel prediction rather than something that is effectively 80m (patch res).

The classification examples that exist on the site seem to use the class embeddings for the scene (e.g., unmsk_patch[:, 0, :]). That seems fine if you're looking to say something about the entire scene / chip. But, I'm more interested in working with something along the lines of the embeddings seen in the 'explore embeddings' tutorial. Except I'd love to be able to get predictions that are not patch scale (e.g., 28x28 pixels), but rather pixel scale (e.g., 224x224). There seem to be at least a couple of paths forward: 1) use something more like the segmentation model that @srmsoumya recently contributed (assuming we have appropriate labels), or 2) work out a way to upsample the embeddings.

Re: option 1, there are some prediction tasks where we won't have labels, as needed during segmentation. Moreover, a full-blown segmentation approach might be more complicated than we need if Clay's embeddings already capture the essence of what's going on in, and around, a pixel. Re: option 2, I've taken a stab at this logic (by computing the embeddings for a bunch of shifted windows and then reassembling the results), but wasn't sure if perhaps there's an easier way (or if this might be wrong-headed to begin with)? Here's what option 2 looks like, for reference:

S2 L2A data	upsampled S2 L2A embeddings

I'll take any guidance or strategies you might have to offer! I noticed that @yellowcap said something perhaps related in the discussion on #231 with:

smaller embeddings can be generated without relying on patches

Though I may be taking that well out of context. Thanks again for all of your monumental efforts here!

Answered by srmsoumya

Jul 30, 2024

Thank you for explaining the visuals!

The smallest chip size that Clay can handle is 8 x 8, so that is the native resolution at which you can get the embeddings. For your use case, you might consider looking into pixel-based models as an option. I recommend checking out PRESTO.

View full answer

srmsoumya · 2024-07-25T06:57:34Z

srmsoumya
Jul 25, 2024
Maintainer

@lzachmann Clay model captures all the features from the image. As we see in this section of the tutorial, each embedding vector captures a unique property of the image. For example, embedding 97 is good at segmenting land & water, while embedding 207 might be good at detecting the shorelines. Some embeddings have visual meaning, while others might be too complex for human eyes but capture underlying representations.

These embeddings are 32x downscaled from the imagery, so all the features are captured at this scale. For a 224 x 224 image, we get a 28 x 28 downscaled feature map. Each cell in this 28 x 28 feature map represents a patch of size 8 x 8. We can upsample this back to the image resolution, but the prediction would be pixelated as they represent an 8 x 8 patch in the original image space.

In the regression and segmentation examples, we extract intermediate feature maps from the model (like in the case of Unets), upsample, and fuse them for these tasks.

To answer your query, in cases where you might not have labels for segmentation, you are basically looking to cluster similar features together (am I right in assuming this?).

One way might be to experiment with different embedding feature maps and find one that fits your use case. For instance, if you want to detect solar panels and a particular embedding does that, you pick that embedding dimension, get your 28 x 28 feature map, and upsample to the input resolution. Another option might be clustering based on all the embeddings for visible features in your input.
In the example you shared above, did you pick a specific embedding dimension to visualize the feature maps?

0 replies

lzachmann · 2024-07-29T22:20:35Z

lzachmann
Jul 29, 2024
Author

Thank you @srmsoumya! (And sorry for the delayed response.) This is helpful. Re:

To answer your query, in cases where you might not have labels for segmentation, you are basically looking to cluster similar features together (am I right in assuming this?).

Not exactly. I was thinking more about a use case involving land cover data like GLanCE. Unlike the Chesapeake Bay dataset, which provides labels as imagery (every pixel in a given chip has a corresponding label), GLanCE gives us land cover at individual points (lat / lon), which rules out a segmentation-based approach.

My original question concerned whether it was possible to fine-tune Clay using Sentinel-2 on a dataset like GLanCE, but make 10m resolution predictions. I suppose it's superficially similar to the classification example, but rather than classify an entire 224x224 chip as a given land cover type, I'm hoping to get per-pixel predictions (similar to what you see in the segmentation outputs). My thought was to use something simple like Random Forest, and make per-pixel predictions. However, merely up-sampling the 28x28 feature maps would produce something that is nominally 10m, but to your point would appear pixelated.

The image I shared was meant to convey one potential answer to the issue described above. The image shows a single embedding dimension (chosen at random). To make the image, I loaded Sentinel-2 imagery (somewhat more than I need for a chip), and computed the embeddings for shifted subsets of the imagery. Basically start at lower left, compute embeddings, store the results, move right one pixel, clip, and recompute the embeddings, etc. We do this eight times left-to-right, and eight times bottom-to-top and stitch all of the results together to get 'smooth' 10m embeddings for an area of interest. I can share more in the way of code at some point if that would be helpful. But just wanted to say thanks for your advice to thus far!

0 replies

srmsoumya · 2024-07-30T10:14:09Z

srmsoumya
Jul 30, 2024
Maintainer

Thank you for explaining the visuals!

The smallest chip size that Clay can handle is 8 x 8, so that is the native resolution at which you can get the embeddings. For your use case, you might consider looking into pixel-based models as an option. I recommend checking out PRESTO.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A simple per-pixel regression or classification head using per-pixel scale embeddings? #305

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

A simple per-pixel regression or classification head using per-pixel scale embeddings? #305

lzachmann Jul 23, 2024

Replies: 3 comments

srmsoumya Jul 25, 2024 Maintainer

lzachmann Jul 29, 2024 Author

srmsoumya Jul 30, 2024 Maintainer

lzachmann
Jul 23, 2024

srmsoumya
Jul 25, 2024
Maintainer

lzachmann
Jul 29, 2024
Author

srmsoumya
Jul 30, 2024
Maintainer