Skip to content

Commit

Permalink
complete the review
Browse files Browse the repository at this point in the history
  • Loading branch information
Gasp34 committed Oct 1, 2024
1 parent 45d5f83 commit a8a29bd
Showing 1 changed file with 47 additions and 27 deletions.
74 changes: 47 additions & 27 deletions collections/_posts/2024-09-30-Eyes_wide_shut.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,27 +12,32 @@ pdf: "https://openaccess.thecvf.com/content/CVPR2024/html/Tong_Eyes_Wide_Shut_Ex

# Highlights

* Visual capabilities in recent MultiModal LLMs (MLLMs) still exhibit systematic shortcomings.
* They identify *CLIP-blind* pairs and construct the Multimodal Visual Patterns (MMVP) benchmark.
* MLLMs have difficulty answering simple questions about nine visual patterns.
* These errors stem mainly from the pre-trained vision encoder and scaling alone may not be the solution.
* They propose a MoF approach that can reduce these visual limitations.

# Introduction : Is vision good enough for language?

Multimodal Large Language Models (MLLMs) integrate images into LLMs and show remarkable capabilities in tasks such as image understanding and visual question answering.

Howver they still exhibit visual shortcomings, some of which are surprisingly elementary and evident (Figure 1).
However they still exhibit visual shortcomings, some of which are surprisingly elementary and evident (Figure 1).

Where do these problems originate? Is it a deficiency in visual modality, language understanding, or their alignment?
> Where do these problems originate? Is it a deficiency in visual modality, language understanding, or their alignment?
A natural hypothesis is that any limitation in the pretrained vision models can cascade into the downstream MLLMs that use them.

<div style="text-align:center"><img src="/collections/images/EWS/EWS1.jpg" width=1500></div>
<p style="text-align: center;font-style:italic">Figure 1. MLLMs (here GPT-4V) struggle with seemingly simple questions due to inaccurate visual grounding. red is an incorrect response, green is an hallucinated explanations.</p>
<p style="text-align: center;font-style:italic">Figure 1. MLLMs (here GPT-4V) struggle with seemingly simple questions due to inaccurate visual grounding. red is an incorrect response, green is an hallucinated explanation.</p>

# Identifying failure examples

They exploit the *erroneous agreements* in the embedding space. If two visually different images are encoded similarly by CLIP, then at least one of the images is likely ambiguously encoded.
They call such a pair of images a *CLIP-blind* pair.

They use the corpus datasets, ImageNet and LAIONAesthetics, to collect these CLIP-blind pairs.
For each pair they compute the embeddings using CLIP-ViT-L-14 and DINOv2-ViT-L-14.
For each pair, they compute the embeddings using CLIP-ViT-L-14 and DINOv2-ViT-L-14.
They return pairs such that the cosine similarity exceeds 0.95 for CLIP embeddings and less than 0.6 for DINOv2 embeddings.

Using these CLIP-blind pairs they build the Multimodal Visual Patterns (MMVP) benchmark.
Expand All @@ -52,43 +57,58 @@ They consider a pair of images to be correctly answered if both the questions as
<div style="text-align:center"><img src="/collections/images/EWS/EWS3.jpg" width=1500></div>
<p style="text-align: center;font-style:italic">Figure 3. Examples of Questions in the MMVP benchmark.</p>

<div style="text-align:center"><img src="/collections/images/EWS/EWS4.jpg" width=1500></div>
<div style="text-align:center"><img src="/collections/images/EWS/EWS4.jpg" width=400></div>
<p style="text-align: center;font-style:italic">Figure 4. Benchmark results of current SOTA MLLM models and humans.</p>

<div style="text-align:center"><img src="/collections/images/EWS/EWS5.jpg" width=1500></div>
<p style="text-align: center;font-style:italic">Figure 5. </p>
# Systematic Failures in CLIP

<div style="text-align:center"><img src="/collections/images/EWS/EWS6.jpg" width=1500></div>
<p style="text-align: center;font-style:italic">Figure 6. </p>
They study the systematic visual patterns in MMVP for which CLIP models struggle.

<div style="text-align:center"><img src="/collections/images/EWS/EWS7.jpg" width=1500></div>
<p style="text-align: center;font-style:italic">Figure 7. </p>
They categorize questions of the MMVP benchmark into 9 categories (see Figure 5) and create a new benchmark to evaluates CLIP models directly (without MLLMs): questions are converted into simpler language descriptions.

<div style="text-align:center"><img src="/collections/images/EWS/EWS5.jpg" width=1500></div>
<p style="text-align: center;font-style:italic">Figure 5. Examples from MMVP-VLM.</p>

# Does scaling up solve the problem?

<div style="text-align:center"><img src="/collections/images/EWS/EWS_T1.jpg" width=1500></div>
<p style="text-align: center;font-style:italic">Table 1. </p>
<p style="text-align: center;font-style:italic">Table 1. Performance of various CLIP based models. Blue is for scaling the input size and green for scaling up the number of parameters.</p>

Increasing model size and training data only aids in identifying two visual patterns – “color and appearance” and “state and condition”.

ImageNet-1k zero-shot accuracy doesn't reflect model performances for visual patterns.

<div style="text-align:center"><img src="/collections/images/EWS/EWS_T2.jpg" width=1500></div>
<p style="text-align: center;font-style:italic">Table 2. </p>
<div style="text-align:center"><img src="/collections/images/EWS/EWS6.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Figure 6. CLIP and MLLM’s performance on visual patterns are correlated. LLaVA 1.5 and InstructBLIP (that explicitly use CLIP) have a correlation score greater than 0.7. </p>

# Mixture-of-Features (MoF) for MLLM

> If MLLM's visual shortcomings come from the CLIP vision encoder, how to build a better one?
They try to mix CLIP features with features coming from a visual-only self-supervised model (like DINO) which have better visual grounding.

For their experiments, they use the open-souce model LLaVA and DINOv2 for the SSL model.

<div style="text-align:center"><img src="/collections/images/EWS/EWS7.jpg" width=1500></div>
<p style="text-align: center;font-style:italic">Figure 7. Different Mixture-of-Feature (MoF) Strategies in MLLM. </p>

<div style="text-align:center"><img src="/collections/images/EWS/EWS_T3.jpg" width=1500></div>
<p style="text-align: center;font-style:italic">Table 3. </p>
## Additive MoF

$$F_{A-MoF} = \alpha*F_{DINOv2} + (1-\alpha)*F_{CLIP}$$

## Mask decoder
They evaluate the model’s visual grounding ability with MMVP and instruction-following capability with the LLaVA benchmark.

The decoder architecture is again quite similar to that of SAM. It takes as input the conditioned frame embeddings (with potential mask encoded embeddings) on one hand, and sparse prompts tokens, concatenated with tokens used for the output, on the other hand. The decoder stacks transformer blocks with self-attention between prompt+ouput tokens and alternated "image to token" and "token to image" cross-attention.
<div style="text-align:center"><img src="/collections/images/EWS/EWS_T2.jpg" width=600></div>
<p style="text-align: center;font-style:italic">Table 2. Empirical Results of Additive MoF. </p>

> N.B: X to Y attention means that X is query and Y is key and value in the attention layer.
There is a trade-off, increasing $$\alpha$$ improve visual grounding abilities but reduces instruction-following capability.

The model produces multiple from the resulting tokens :
* The tokens are upsampled with convolutions and summed with stages 1 and 2 frame features from Hiera image encoder. They are multiplied with the output tokens (after passing through 3 MLP layers) to produce **low-resolution masks** (at a resolution 4x lower than input image).
* The output tokens is also used to produce what are called **object pointers**, which are lightweight vectors to encode high-level semantic information of the objecct to segment.
* As in SAM, there could be ambiguity on the mask to predict, as an object can have several subparts that are also segmented. For this reason, the model produces multiple output masks (3 in general) and predicts the **IoU scores** between the ground truth and each mask.
* Finally, a specific constraint of video segmentation is that objects of interest can disappear from the image, momentarily or permanently. Hence, SAM 2 also predicts an **oclusion score** for each object.
## Interleaved MoF

From these outputs, few post-processing steps, including resolving mask covering (based on the predicted IoU), mask refinement and a final 4x upsampling, allow to produce the final segmentation map at the original image size.
They try another method in which features of CLIP and DINOv2 are interleaved while maintaining their spatial order.

# Reference
<div style="text-align:center"><img src="/collections/images/EWS/EWS_T3.jpg" width=500></div>
<p style="text-align: center;font-style:italic">Table 3. Empirical Results of Interleaved MoF.</p>

[^1]: C. Ryali, Y.-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y. Huang, V Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman, J. Malik, Y. Li, and C. Feichtenhofer. Hiera: A hierarchical vision transformer without the bells-and-whistles. International Conference on Machine Learning (ICML), 2023.
It increases visual grounding abilities without compromising the ability to follow instructions.

0 comments on commit a8a29bd

Please sign in to comment.