Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the model capable of detecting open-vocabulary objects, such as grounding dino? #2

Open
aixiaodewugege opened this issue Aug 30, 2023 · 9 comments
Labels
good first issue Good for newcomers question Further information is requested

Comments

@aixiaodewugege
Copy link

Thanks for your brilliant work!

I'm wondering if the model can detect all objects, such as a 'grounding dino'?

@JacobYuan7
Copy link
Owner

JacobYuan7 commented Aug 30, 2023

@aixiaodewugege
Hi, many thanks for your interest in my work. Yes, you've grasped the concept accurately. Due to the annotation style of Visual Genome and the pseudo-labelled Objects365, it has such ability. Nevertheless, I have to confess that it is not as capable as Grounding DINO, primarily due to the dataset's scale and the nature of the pseudo-labelled annotations.

@aixiaodewugege
Copy link
Author

Thanks for your reply!

Have you considered the possibility that HOI could enhance the accuracy of action recognition problems, like Kinetics400, given that you claim your model has superior zero-shot performance?

@JacobYuan7
Copy link
Owner

JacobYuan7 commented Aug 31, 2023

@aixiaodewugege
My answer is yes. It's definitely reasonable that it can boost the performance of action recognition if introducing extra cues from a relation detection model. However, if I were you, I would start from a fine-tuned model (Swin-L perhaps) since the HICO-Det dataset covers a wide range of object classes and verb classes.

@aixiaodewugege
Copy link
Author

Thanks!

Do you have any suggestions on how I can integrate a video-based action recognition model with an image-based HOI model? Should I use the same image encoder, like mPLUG?

@aixiaodewugege aixiaodewugege changed the title Is the model capable of detecting open-vocabulary objects, such as grounding a dino? Is the model capable of detecting open-vocabulary objects, such as grounding dino? Aug 31, 2023
@JacobYuan7
Copy link
Owner

@aixiaodewugege
I doubt the way of utilizing a joint image encoder since detection backbones usually require fine-tuning for the detection head. It is only viable when the backbone for action recognition model and the backbone for HOI detection are jointly trained.

@aixiaodewugege
Copy link
Author

Thanks!

I'm new to the HOI (Human-Object Interaction) and action recognition tasks. They have been using different dataset. I'm curious as to why there haven't been attempts to combine them, given that it seems logically beneficial for both tasks.

@JacobYuan7
Copy link
Owner

@aixiaodewugege
It is definitely worth trying. But I speculate that as long as they target publishing research papers, its novelty requires a second assessment.

@aixiaodewugege
Copy link
Author

Thanks for your patient! It is really helpful!

I have few question about your method.

  1. Is Label Sequence means the caption about the image? If so, does it mean when doing inference on a image, we have to use blip to generate its caption first? Or does it mean the candidate relation that I care about, like I only want to detect 'eat', 'drink' these two verbs, I should put there word into Label Sequence?
    image

  2. RLIPv2 is reuse to tag the relation. How the first version RLIPv2 is trained? Is it trained on VG dataset for It has tagged the relation?

@JacobYuan7
Copy link
Owner

@aixiaodewugege

  1. With respect to the label sequence, it is indeed a sequence of labels rather than a whole caption. You can refer to RLIPv1 for clearer illustrations. Thus, this sequence can be dataset-specific rather than image-specific. For example, when we perform fine-tuning on HICO-DET, the text labels in the label sequence are identical for all images, which are all possible object texts and verb texts in HICO-DET. Back to your question, this understanding Or does it mean the candidate relation that I care about, like I only want to detect 'eat', 'drink' these two verbs, I should put there word into Label Sequence? is correct.
  2. I do not understand what the first version is. If you mean by 'how R-Tagger is trained', I would recommend reading Sec.4.2.2.

@JacobYuan7 JacobYuan7 added good first issue Good for newcomers question Further information is requested labels Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants