RADIOv2.5 Tech Report

This is a tech report for the early-access release of the RADIOv2.5 model family. We plan on publishing full papers on the techniques applied for this release in upcoming conferences, but wanted to share the latest models with the community as soon as possible.

On 7.22.24 we are releasing ViT-B/16 and ViT-L/16 pretrained models. Under the hood, we've made a bunch of improvements to the training algorithms to produce these models. Fortunately, the API remains exactly the same!

Update: On 10.2.24 we are also releasing a ViT-H/16 model, which is now our best offering!

Usage

torch.hub.load('NVlabs/RADIO', 'radio_model',
    version='radio_v2.5-h',  # Can also be 'radio_v2.5-b' for the ViT-B version, 'radio_v2.5-l' for the ViT-L version
    force_reload=True,  # Make sure you set this to True the first time you're requesting either of these two models
)

What's New?

Smaller Models

First off, our previous releases have been ViT-H models. While ViT-H is a very powerful model architecture, we've heard from the community that there is a need for smaller VFMs (Visual Foundation Model). With this release, we're releasing ViT-B/16 and ViT-L/16 models which still achieve very strong quality, while being much smaller and faster than ViT-H. In fact, we're so confident in our ViT-L/16 model (RADIOv2.5-L) that we think you should use it instead of RADIOv2.

Mode Switching

A major issue we identified in the paper is that RADIO was "mode switching" based on the input resolution of the image. In effect, when the resolution was approximately less than 704, it was running in "CLIP + DINOv2" mode where the features were very relevant to those two teachers, but completely irrelevant to SAM. At >720px, RADIO was switching modes to produce features that were relevant for SAM, but suddenly incapable of modeling CLIP or DINOv2.

This would show up in strange ways, for example, trying to do zero shot classification at high resolution would degrade to random guessing (0.1% for ImageNet-1k). It also meant that our results at hi-res when integrated into a VLLM (e.g. LLaVA 1.5 / Vicuna 7B) were similarly poor. Starting with RADIOv2.5, we've solved the mode switching problem, and now these models are truly capable of processing any input resolution without surprising changes in behavior. In fact, RADIOv2.5 loves high resolution, where our best classification and VLLM results are coming from >= 768px resolutions.

Similar to the paper, we plot the MSE between the DINOv2-g-reg features and the RADIO model at various resolutions. While RADIOv2 (owing to the ViT-H) is able to achieve lower MSE at lower resolutions, you can see how at 720px, there's a huge spike in error and never recovers. This is how we quantified the mode switch. We can also visualize this phenomenon:

You can see how the 720px RADIOv2 (left) image abruptly changes representions, whereas DINOv2 and the RADIOv2.5 models (middle, right) remain consistent and instead produce increasingly fine-grained details. We can also see how RADIOv2 is working in reverse with the SAM head, where the low-resolution inputs don't produce features that are SAM-like at all. At 1024px, RADIOv2 starts to produce reasonable SAM features. On the contrary, RADIOv2.5-L produces SAM-like features at any resolution, and arguably does a better job of extrapolating to 2048px resolution.

Similarly to mode switching being directly observable in the spatial features, it was also causing issues with the summary features, which can be seen looking at zero shot classification:

Resolution	RADIOv2.1	RADIOv2.5-B	RADIOv2.5-L	RADIOv2.5-H
224	78.892	62.344	74.852	79.158
256	80.780	68.892	78.220	80.156
336	82.320	72.626	80.004	81.426
432	82.800	73.628	80.460	81.944
512	82.882	73.894	80.542	82.162
768	1.292	74.386	80.804	82.088
1024	0.204	74.280	80.886	82.304

Resolution	RADIOv2.1	RADIOv2.5-B	RADIOv2.5-L	RADIOv2.5-H
512 - ViTDet 16	82.370	70.488	78.102	80.058
1024 - ViTDet 16	0.192	72.182	79.878	81.834

Not only do the RADIOv2.5 models allow classification at any resolution, they also allow for using ViTDet mode with only a small drop in accuracy.

There is an important implication to fixing mode switching, which is that it's now possible to ask for both the CLIP and SAM features for a given hi-res image simultaneously, and the results will be meaningful for both. Or, you might want to get the hi-res DINOv2 spatial features as well as the summary token (for classification) for the same image. This wasn't possible with the RADIOv2 model because it wasn't able to simultaneously represent CLIP (or DINO) and SAM at the same time, but is now fixed with the v2.5 models.

LLaVA 1.5 + Vicuna 7B

Last but not least, we tested out the models at various resolutions within LLaVA 1.5 + Vicuna 7B:

Model	Resolution	GQA		TextVQA*		POPE	VQAv2
Model	Resolution	Val	TestDev	Tokens	No Tokens	POPE	VQAv2
RADIOv2.1	432	71.70	63.01	56.32	42.03	86.20	79.28
RADIOv2.5-B	432	70.49	62.09	52.13	32.43	85.87	77.24
	512	71.08	62.70	54.36	36.39	86.59	78.03
	768	71.99	63.31	56.93	43.96	87.54	79.22
RADIOv2.5-L	432	71.57	62.89	56.71	42.34	86.13	79.44
	512	72.04	63.58	58.52	46.50	86.66	80.04
	768	72.91	64.13	61.93	53.95	87.68	81.02
RADIOv2.5-H	432	73.22	64.91	58.66	47.61	85.95	80.49
	512	73.60	64.98	60.03	51.99	86.73	80.96
	768	74.04	65.03	62.39	56.93	87.36	81.56

*By default, TextVQA adds detected OCR tokens into the context of the LLM. Because we're interested in how well the vision encoder itself is able to represent text, we study TextVQA both with (Tokens) and without (No Tokens) these tokens.

SigLIP

SigLIP is an extraordinary ViT-L model, and we've added it as a teacher in the latest release. If you'd like to use RADIO's adaptor for it, you can get it using the 'siglip' adaptor name. For example, in the examples/zero_shot_imagenet.py script, you'd pass --adaptor-name siglip as an argument to use SigLIP instead of the default DFN CLIP.

The specific SigLIP version we're using is ViT-SO400M-14-SigLIP-384 found in the OpenCLIP library.

Resolution	RADIOv2.5-B	RADIOv2.5-L	RADIOv2.5-H
224	58.670	72.492	76.796
256	65.190	75.962	77.984
336	69.110	77.830	79.400
432	70.276	78.582	79.990
512	70.694	78.828	80.258
768	71.102	78.930	80.172
1024	70.900	78.922	80.476

As can be seen, the classification results using the SigLIP head are slightly worse than those of DFN CLIP, so we'd suggest defaulting to DFN CLIP unless you're specifically looking for compatibility.

Videos!

While RADIOv2 may have a visually pleasing 1024px resolution video, you can clearly see how it switches modes between low and high resolution. All models exhibit strong temporal stability.

RADIOv2

RADIOv2 512px

maggie_hop_512_sbs.mp4

RADIOv2 1024px

maggie_hop_1024_sbs.mp4

RADIOv2.5-B

RADIOv2.5-B 512px

maggie_hop_512_sbs.mp4

RADIOv2.5-B 1024px

maggie_hop_1024_sbs.mp4

RADIOv2.5-L

RADIOv2.5-L 512px

maggie_hop_512_sbs.mp4

RADIOv2.5-L 1024px

maggie_hop_1024_sbs.mp4

RADIOv2.5-H

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RADIOv2.5_tech_report.md

RADIOv2.5_tech_report.md

RADIOv2.5 Tech Report

Usage

What's New?

Smaller Models

Mode Switching

LLaVA 1.5 + Vicuna 7B

SigLIP

Videos!

RADIOv2 512px

RADIOv2 1024px

RADIOv2.5-B 512px

RADIOv2.5-B 1024px

RADIOv2.5-L 512px

RADIOv2.5-L 1024px

RADIOv2.5-H 512px

RADIOv2.5-H 1024px

Files

RADIOv2.5_tech_report.md

Latest commit

History

RADIOv2.5_tech_report.md

File metadata and controls

RADIOv2.5 Tech Report

Usage

What's New?

Smaller Models

Mode Switching

LLaVA 1.5 + Vicuna 7B

SigLIP

Videos!

RADIOv2 512px

RADIOv2 1024px

RADIOv2.5-B 512px

RADIOv2.5-B 1024px

RADIOv2.5-L 512px

RADIOv2.5-L 1024px

RADIOv2.5-H 512px

RADIOv2.5-H 1024px