Skip to content

Latest commit

 

History

History
221 lines (152 loc) · 10.7 KB

RADIOv2.5_tech_report.md

File metadata and controls

221 lines (152 loc) · 10.7 KB

RADIOv2.5 Tech Report

This is a tech report for the early-access release of the RADIOv2.5 model family. We plan on publishing full papers on the techniques applied for this release in upcoming conferences, but wanted to share the latest models with the community as soon as possible.

On 7.22.24 we are releasing ViT-B/16 and ViT-L/16 pretrained models. Under the hood, we've made a bunch of improvements to the training algorithms to produce these models. Fortunately, the API remains exactly the same!

Update: On 10.2.24 we are also releasing a ViT-H/16 model, which is now our best offering!

Usage

torch.hub.load('NVlabs/RADIO', 'radio_model',
    version='radio_v2.5-h',  # Can also be 'radio_v2.5-b' for the ViT-B version, 'radio_v2.5-l' for the ViT-L version
    force_reload=True,  # Make sure you set this to True the first time you're requesting either of these two models
)

What's New?

Smaller Models

First off, our previous releases have been ViT-H models. While ViT-H is a very powerful model architecture, we've heard from the community that there is a need for smaller VFMs (Visual Foundation Model). With this release, we're releasing ViT-B/16 and ViT-L/16 models which still achieve very strong quality, while being much smaller and faster than ViT-H. In fact, we're so confident in our ViT-L/16 model (RADIOv2.5-L) that we think you should use it instead of RADIOv2.

Mode Switching

A major issue we identified in the paper is that RADIO was "mode switching" based on the input resolution of the image. In effect, when the resolution was approximately less than 704, it was running in "CLIP + DINOv2" mode where the features were very relevant to those two teachers, but completely irrelevant to SAM. At >720px, RADIO was switching modes to produce features that were relevant for SAM, but suddenly incapable of modeling CLIP or DINOv2.

This would show up in strange ways, for example, trying to do zero shot classification at high resolution would degrade to random guessing (0.1% for ImageNet-1k). It also meant that our results at hi-res when integrated into a VLLM (e.g. LLaVA 1.5 / Vicuna 7B) were similarly poor. Starting with RADIOv2.5, we've solved the mode switching problem, and now these models are truly capable of processing any input resolution without surprising changes in behavior. In fact, RADIOv2.5 loves high resolution, where our best classification and VLLM results are coming from >= 768px resolutions.

Similar to the paper, we plot the MSE between the DINOv2-g-reg features and the RADIO model at various resolutions. While RADIOv2 (owing to the ViT-H) is able to achieve lower MSE at lower resolutions, you can see how at 720px, there's a huge spike in error and never recovers. This is how we quantified the mode switch. We can also visualize this phenomenon:

You can see how the 720px RADIOv2 (left) image abruptly changes representions, whereas DINOv2 and the RADIOv2.5 models (middle, right) remain consistent and instead produce increasingly fine-grained details. We can also see how RADIOv2 is working in reverse with the SAM head, where the low-resolution inputs don't produce features that are SAM-like at all. At 1024px, RADIOv2 starts to produce reasonable SAM features. On the contrary, RADIOv2.5-L produces SAM-like features at any resolution, and arguably does a better job of extrapolating to 2048px resolution.

Similarly to mode switching being directly observable in the spatial features, it was also causing issues with the summary features, which can be seen looking at zero shot classification:

Resolution RADIOv2.1 RADIOv2.5-B RADIOv2.5-L RADIOv2.5-H
224 78.892 62.344 74.852 79.158
256 80.780 68.892 78.220 80.156
336 82.320 72.626 80.004 81.426
432 82.800 73.628 80.460 81.944
512 82.882 73.894 80.542 82.162
768 1.292 74.386 80.804 82.088
1024 0.204 74.280 80.886 82.304
Resolution RADIOv2.1 RADIOv2.5-B RADIOv2.5-L RADIOv2.5-H
512 - ViTDet 16 82.370 70.488 78.102 80.058
1024 - ViTDet 16 0.192 72.182 79.878 81.834

Not only do the RADIOv2.5 models allow classification at any resolution, they also allow for using ViTDet mode with only a small drop in accuracy.

There is an important implication to fixing mode switching, which is that it's now possible to ask for both the CLIP and SAM features for a given hi-res image simultaneously, and the results will be meaningful for both. Or, you might want to get the hi-res DINOv2 spatial features as well as the summary token (for classification) for the same image. This wasn't possible with the RADIOv2 model because it wasn't able to simultaneously represent CLIP (or DINO) and SAM at the same time, but is now fixed with the v2.5 models.

LLaVA 1.5 + Vicuna 7B

Last but not least, we tested out the models at various resolutions within LLaVA 1.5 + Vicuna 7B:

Model Resolution GQA TextVQA* POPE VQAv2
Val TestDev Tokens No Tokens
RADIOv2.143271.7063.0156.3242.0386.2079.28
RADIOv2.5-B43270.4962.0952.1332.4385.8777.24
51271.0862.7054.3636.3986.5978.03
76871.9963.3156.9343.9687.5479.22
RADIOv2.5-L43271.5762.8956.7142.3486.1379.44
51272.0463.5858.5246.5086.6680.04
76872.9164.1361.9353.9587.6881.02
RADIOv2.5-H43273.2264.9158.6647.6185.9580.49
51273.6064.9860.0351.9986.7380.96
76874.0465.0362.3956.9387.3681.56

*By default, TextVQA adds detected OCR tokens into the context of the LLM. Because we're interested in how well the vision encoder itself is able to represent text, we study TextVQA both with (Tokens) and without (No Tokens) these tokens.

SigLIP

SigLIP is an extraordinary ViT-L model, and we've added it as a teacher in the latest release. If you'd like to use RADIO's adaptor for it, you can get it using the 'siglip' adaptor name. For example, in the examples/zero_shot_imagenet.py script, you'd pass --adaptor-name siglip as an argument to use SigLIP instead of the default DFN CLIP.

The specific SigLIP version we're using is ViT-SO400M-14-SigLIP-384 found in the OpenCLIP library.

Resolution RADIOv2.5-B RADIOv2.5-L RADIOv2.5-H
224 58.670 72.492 76.796
256 65.190 75.962 77.984
336 69.110 77.830 79.400
432 70.276 78.582 79.990
512 70.694 78.828 80.258
768 71.102 78.930 80.172
1024 70.900 78.922 80.476

As can be seen, the classification results using the SigLIP head are slightly worse than those of DFN CLIP, so we'd suggest defaulting to DFN CLIP unless you're specifically looking for compatibility.

Videos!

While RADIOv2 may have a visually pleasing 1024px resolution video, you can clearly see how it switches modes between low and high resolution. All models exhibit strong temporal stability.

RADIOv2

RADIOv2 512px

maggie_hop_512_sbs.mp4

RADIOv2 1024px

maggie_hop_1024_sbs.mp4
RADIOv2.5-B

RADIOv2.5-B 512px

maggie_hop_512_sbs.mp4

RADIOv2.5-B 1024px

maggie_hop_1024_sbs.mp4
RADIOv2.5-L

RADIOv2.5-L 512px

maggie_hop_512_sbs.mp4

RADIOv2.5-L 1024px

maggie_hop_1024_sbs.mp4
RADIOv2.5-H

RADIOv2.5-H 512px

RADIOv2.5-H 1024px