Is the model capable of detecting open-vocabulary objects, such as grounding dino? #2

aixiaodewugege · 2023-08-30T09:14:48Z

Thanks for your brilliant work!

I'm wondering if the model can detect all objects, such as a 'grounding dino'?

JacobYuan7 · 2023-08-30T11:52:56Z

@aixiaodewugege
Hi, many thanks for your interest in my work. Yes, you've grasped the concept accurately. Due to the annotation style of Visual Genome and the pseudo-labelled Objects365, it has such ability. Nevertheless, I have to confess that it is not as capable as Grounding DINO, primarily due to the dataset's scale and the nature of the pseudo-labelled annotations.

aixiaodewugege · 2023-08-31T02:42:43Z

Thanks for your reply!

Have you considered the possibility that HOI could enhance the accuracy of action recognition problems, like Kinetics400, given that you claim your model has superior zero-shot performance?

JacobYuan7 · 2023-08-31T02:57:11Z

@aixiaodewugege
My answer is yes. It's definitely reasonable that it can boost the performance of action recognition if introducing extra cues from a relation detection model. However, if I were you, I would start from a fine-tuned model (Swin-L perhaps) since the HICO-Det dataset covers a wide range of object classes and verb classes.

aixiaodewugege · 2023-08-31T03:05:43Z

Thanks!

Do you have any suggestions on how I can integrate a video-based action recognition model with an image-based HOI model? Should I use the same image encoder, like mPLUG?

JacobYuan7 · 2023-08-31T04:49:07Z

@aixiaodewugege
I doubt the way of utilizing a joint image encoder since detection backbones usually require fine-tuning for the detection head. It is only viable when the backbone for action recognition model and the backbone for HOI detection are jointly trained.

aixiaodewugege · 2023-08-31T06:41:40Z

Thanks！

I'm new to the HOI (Human-Object Interaction) and action recognition tasks. They have been using different dataset. I'm curious as to why there haven't been attempts to combine them, given that it seems logically beneficial for both tasks.

JacobYuan7 · 2023-08-31T07:30:41Z

@aixiaodewugege
It is definitely worth trying. But I speculate that as long as they target publishing research papers, its novelty requires a second assessment.

aixiaodewugege · 2023-08-31T07:48:08Z

Thanks for your patient! It is really helpful!

I have few question about your method.

Is Label Sequence means the caption about the image? If so, does it mean when doing inference on a image, we have to use blip to generate its caption first? Or does it mean the candidate relation that I care about, like I only want to detect 'eat', 'drink' these two verbs, I should put there word into Label Sequence?
RLIPv2 is reuse to tag the relation. How the first version RLIPv2 is trained? Is it trained on VG dataset for It has tagged the relation?

JacobYuan7 · 2023-08-31T11:39:58Z

@aixiaodewugege

With respect to the label sequence, it is indeed a sequence of labels rather than a whole caption. You can refer to RLIPv1 for clearer illustrations. Thus, this sequence can be dataset-specific rather than image-specific. For example, when we perform fine-tuning on HICO-DET, the text labels in the label sequence are identical for all images, which are all possible object texts and verb texts in HICO-DET. Back to your question, this understanding Or does it mean the candidate relation that I care about, like I only want to detect 'eat', 'drink' these two verbs, I should put there word into Label Sequence? is correct.
I do not understand what the first version is. If you mean by 'how R-Tagger is trained', I would recommend reading Sec.4.2.2.

aixiaodewugege changed the title ~~Is the model capable of detecting open-vocabulary objects, such as grounding a dino?~~ Is the model capable of detecting open-vocabulary objects, such as grounding dino? Aug 31, 2023

JacobYuan7 added good first issue Good for newcomers question Further information is requested labels Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the model capable of detecting open-vocabulary objects, such as grounding dino? #2

Is the model capable of detecting open-vocabulary objects, such as grounding dino? #2

aixiaodewugege commented Aug 30, 2023

JacobYuan7 commented Aug 30, 2023 •

edited

Loading

aixiaodewugege commented Aug 31, 2023

JacobYuan7 commented Aug 31, 2023 •

edited

Loading

aixiaodewugege commented Aug 31, 2023

JacobYuan7 commented Aug 31, 2023

aixiaodewugege commented Aug 31, 2023

JacobYuan7 commented Aug 31, 2023

aixiaodewugege commented Aug 31, 2023

JacobYuan7 commented Aug 31, 2023

Is the model capable of detecting open-vocabulary objects, such as grounding dino? #2

Is the model capable of detecting open-vocabulary objects, such as grounding dino? #2

Comments

aixiaodewugege commented Aug 30, 2023

JacobYuan7 commented Aug 30, 2023 • edited Loading

aixiaodewugege commented Aug 31, 2023

JacobYuan7 commented Aug 31, 2023 • edited Loading

aixiaodewugege commented Aug 31, 2023

JacobYuan7 commented Aug 31, 2023

aixiaodewugege commented Aug 31, 2023

JacobYuan7 commented Aug 31, 2023

aixiaodewugege commented Aug 31, 2023

JacobYuan7 commented Aug 31, 2023

JacobYuan7 commented Aug 30, 2023 •

edited

Loading

JacobYuan7 commented Aug 31, 2023 •

edited

Loading