A curated list of papers and resources related to Described Object Detection, Open-Vocabulary/Open-World Object Detection and Referring Expression Comprehension.
If you find any work or resources missing, please send a pull requests. Thanks!
📑 If you find our projects helpful to your research, please consider citing:
@inproceedings{xie2023DOD,
title={Described Object Detection: Liberating Object Detection with Flexible Expressions},
author={Xie, Chi and Zhang, Zhao and Wu, Yixuan and Zhu, Feng and Zhao, Rui and Liang, Shuang},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS)},
year={2023}
}
A leaderboard for update-to-date DOD methods are available here.
-
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM (ECCV 2024) [paper]
-
Learning Compositional Language-based Object Detection with Diffusion-based Synthetic Data (CVPR 2024 Workshop) [paper]
-
LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding (arxiv 2024) [paper]
-
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection (arxiv 2024) [paper] [code]
-
Generating Enhanced Negatives for Training Language-Based Object Detectors (CVPR 2024) [paper] [code]
-
Aligning and Prompting Everything All at Once for Universal Visual Perception (arxiv 2023) [paper] [code]
-
Described Object Detection: Liberating Object Detection with Flexible Expressions (NeurIPS 2023) [paper] [dataset] [code]
-
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone (NeurIPS 2022) [paper] [code]
-
GLIPv2: Unifying Localization and Vision-Language Understanding (NeurIPS 2022) [paper] [code]
-
Grounded Language-Image Pre-training (CVPR 2022) [paper] [code]
These methods are either MLLM with capabilities related to detection/localization, or multi-task models handling both OD/OVD and REC. Though they are not directly handling DOD and not evaluated on DOD benchmarks in their original papers, it is possible that they obtain a performance similar to the DOD baseline.
-
Generative Region-Language Pretraining for Open-Ended Object Detection (CVPR 2024) [paper] [code]
-
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation (CVPR 2024) [paper]
-
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors (ICLR 2024) [paper]
-
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models (arxiv 2023) [paper] [code]
-
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs (arxiv 2023) [paper] [code (TBD)]
-
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models (arxiv 2023) [paper] [code]
-
Ferret: Refer and Ground Anything Anywhere at Any Granularity [paper] [code]
-
Contextual Object Detection with Multimodal Large Language Models (arxiv 2023) [paper] [demo] [code]
-
Kosmos-2: Grounding Multimodal Large Language Models to the World (ICLR 2024) [paper] [demo] [code]
-
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (arxiv 2023) [paper] [demo] [code]
-
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic (arxiv 2023) [paper] [demo] [code]
-
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (arxiv 2023) [paper] [code (eval)] (REC, OD, etc.)
-
Universal Instance Perception as Object Discovery and Retrieval (CVPR 2023) [paper] [code] (REC, OVD, etc.)
-
FindIt: Generalized Localization with Natural Language Queries (ECCV 2022) [paper] [code] (REC, OD, etc.)
-
GRiT: A Generative Region-to-text Transformer for Object Understanding (arxiv 2022) [paper] [demo (colab)] [code]
Note that some generic object detection methods accepting language prompts are also listed here. Though they may not be evaluated on OVD benchmarks, they are essentially capable of this setting.
-
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection (ECCV 2024) [paper] [code]
-
LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction (ECCV 2024) [paper]
-
OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer (arxiv 2024) [paper] [code (TBD)]
-
OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion (arxiv 2024) [paper] [code]
-
Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection (CVPR 2024) [paper]
-
OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision (arxiv 2024) [paper] [code]
-
SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection (CVPR 2024) [paper] [code]
-
Open-Vocabulary Object Detection via Neighboring Region Attention Alignment (arxiv 2024) [paper]
-
Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation (arxiv 2024) [paper] [code]
-
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection (CVPR 2024) [paper]
-
Hyperbolic Learning with Synthetic Captions for Open-World Detection (CVPR 2024) [paper]
-
Retrieval-Augmented Open-Vocabulary Object Detection (CVPR 2024) [paper] [code (TBD)]
-
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy (arxiv 2024) [paper] [code]
-
The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding (CVPR 2024) [paper] [code]
-
Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head (arxiv 2024) [paper] [code]
-
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset (arxiv 2024) [paper]
-
YOLO-World: Real-Time Open-Vocabulary Object Detection (arxiv 2024) [paper] [code]
-
CLIM: Contrastive Language-Image Mosaic for Region Representation (AAAI 2024) [paper] [code]
-
Simple Image-level Classification Improves Open-vocabulary Object Detection (arxiv 2023) [paper] [code]
-
ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open Vocabulary Object Detection (AAAI 2024) [paper]
-
OpenSD: Unified Open-Vocabulary Segmentation and Detection (arxiv 2023) [paper] [code (TBD)]
-
Boosting Segment Anything Model Towards Open-Vocabulary Learning (arxiv 2023) [paper]
-
Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection (arxiv 2023) [paper]
-
Language-conditioned Detection Transformer (arxiv 2023) [paper] [code]
-
LP-OVOD: Open-Vocabulary Object Detection by Linear Probing (WACV 2024) [paper] [code]
-
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model (NeurIPS 2023) [paper]
-
Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization (BMVC 2023) [paper]
-
CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection (NeurIPS 2023) [paper] [code]
-
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection (arxiv 2023) [paper] [code]
-
Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection (arxiv 2023) [paper]
-
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection (arxiv 2023) [paper]
-
How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection (arxiv 2023) [paper] [dataset]
-
Improving Pseudo Labels for Open-Vocabulary Object Detection (arxiv 2023) [paper]
-
Scaling Open-Vocabulary Object Detection (arxiv 2023) [paper] [code (jax)]
-
Unified Open-Vocabulary Dense Visual Prediction (arxiv 2023) [paper]
-
TIB: Detecting Unknown Objects Via Two-Stream Information Bottleneck (TPAMI 2023) [paper]
-
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection (TNNLS 2023) [paper]
-
Open-Vocabulary Object Detection via Scene Graph Discovery (ACM MM 2023) [paper]
-
Three Ways to Improve Feature Alignment for Open Vocabulary Detection (arXiv 2023) [paper]
-
Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection (arXiv 2023) [paper]
-
Open-Vocabulary Object Detection using Pseudo Caption Labels (arXiv 2023) [paper]
-
What Makes Good Open-Vocabulary Detector: A Disassembling Perspective (KDD 2023 Workshop) [paper]
-
Open-Vocabulary Object Detection With an Open Corpus (ICCV 2023) [paper]
-
Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection (ICCV 2023) [paper] [code]
-
A Simple Framework for Open-Vocabulary Segmentation and Detection (ICCV 2023) [paper] [code]
-
EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment (ICCV 2023) [paper] [website]
-
Contrastive Feature Masking Open-Vocabulary Vision Transformer (ICCV 2023) [paper]
-
Multi-Modal Classifiers for Open-Vocabulary Object Detection (ICML 2023) [paper] [code (eval)]
-
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching (CVPR 2023) [paper] [code]
-
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection (CVPR 2023) [paper] [code]
-
Aligning Bag of Regions for Open-Vocabulary Object Detection (CVPR 2023) [paper] [code]
-
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers (CVPR 2023) [paper] [code]
-
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment (CVPR 2023) [paper]
-
Learning to Detect and Segment for Open Vocabulary Object Detection (CVPR 2023) [paper]
-
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models (ICLR 2023) [paper] [code] [website]
-
Learning Object-Language Alignments for Open-Vocabulary Object Detection (ICLR 2023) [paper] [code]
-
Simple Open-Vocabulary Object Detection with Vision Transformers (ECCV 2022) [paper] [code (jax)] [code (huggingface)]
-
Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization (arXiv 2022) [paper] [code]
-
Localized Vision-Language Matching for Open-vocabulary Object Detection (GCPR 2022) [paper] [code]
-
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection (NeurIPS 2022) [paper] [code]
-
X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks (ECCV 2022) [paper]
-
Exploiting Unlabeled Data with Vision and Language Models for Object Detection (ECCV 2022) [paper] [code]
-
PromptDet: Towards Open-vocabulary Detection using Uncurated Images (ECCV 2022) [paper] [website] [code]
-
Open-Vocabulary DETR with Conditional Matching (ECCV 2022) [paper] [code]
-
Open Vocabulary Object Detection with Pseudo Bounding-Box Labels (ECCV 2022) [paper] [code]
-
Simple Open-Vocabulary Object Detection with Vision Transformers (ECCV 2022) [paper] [code]
-
RegionCLIP: Region-Based Language-Image Pretraining (CVPR 2022) [paper] [code]
-
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling (CVPR 2022) [paper] [code]
-
Open-Vocabulary One-Stage Detection With Hierarchical Visual-Language Knowledge Distillation (CVPR 2022) [paper] [code]
-
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model (CVPR 2022) [paper] [code]
-
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation (ICLR 2022) [paper] [code]
-
Open-Vocabulary Object Detection Using Captions (CVPR 2021) [paper] [code]
-
Visual Grounding with Dual Knowledge Distillation (TCSVT 2024) [paper]
-
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models (arxiv 2024) [paper] [code]
-
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding (arxiv 2024) [paper] [code]
-
Learning from Models and Data for Visual Grounding (arxiv 2024) [paper]
-
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection (arxiv 2023) [paper] [code]
-
Context Disentangling and Prototype Inheriting for Robust Visual Grounding (TPAMI 2023) [paper] [code]
-
Cycle-Consistency Learning for Captioning and Grounding (AAAI 2024) [paper]
-
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions (arxiv 2023) [paper]
-
Continual Referring Expression Comprehension via Dual Modular Memorization (arxiv 2023) [paper] [code]
-
ViLaM: A Vision-Language Model with Enhanced Visual Grounding and Generalization Capability (arxiv 2023) [paper]
-
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding (arxiv 2023) [paper] [code]
-
VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders (arxiv 2023) [paper]
-
Language-Guided Diffusion Model for Visual Grounding (arxiv 2023) [paper] [code (TBD)]
-
Fine-Grained Visual Prompting (arxiv 2023) [paper]
-
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities (arxiv 2023) [paper] [code]
-
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding (TMM 2023) [paper] [code]
-
Unleashing Text-to-Image Diffusion Models for Visual Perception (ICCV 2023) [paper] [website] [code]
-
Focusing On Targets For Improving Weakly Supervised Visual Grounding (ICASSP 2023) [paper]
-
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks (ICLR 2023) [paper] [code (eval)]
-
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation (CVPR 2023) [paper] [website] [code] [demo]
-
Advancing Visual Grounding With Scene Knowledge: Benchmark and Method (CVPR 2023) [paper] [code]
-
Language Adaptive Weight Generation for Multi-task Visual Grounding (CVPR 2023) [paper]
-
From Coarse to Fine-grained Concept based Discrimination for Phrase Detection (CVPR 2023 Workshop) [paper]
-
Referring Expression Comprehension Using Language Adaptive Inference (AAAI 2023) [paper]
-
DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding (AAAI 2023) [paper] [code]
-
One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning (arxiv 2022) [paper]
-
Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension (arxiv 2022) [paper]
-
SeqTR: A Simple yet Universal Network for Visual Grounding (ECCV 2022) [paper] [code]
-
SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding (ECCV 2022) [paper]
-
Towards Unifying Reference Expression Generation and Comprehension (EMNLP 2022) [paper]
-
Correspondence Matters for Video Referring Expression Comprehension (ACM MM 2022) [paper]
-
Visual Grounding with Transformers (ICME 2022) [paper] [code]
-
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning (CVPR 2022) [paper] [code]
-
Multi-Modal Dynamic Graph Transformer for Visual Grounding (CVPR 2022) [paper] [code]
-
Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding (CVPR 2022) [paper] [code]
-
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework (ICML 2022) [paper] [code]
-
Towards Language-guided Visual Recognition via Dynamic Convolutions (arxiv 2021) [paper]
-
Referring Transformer: A One-step Approach to Multi-task Visual Grounding (NeurIPS 2021) [paper] [code]
-
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring (ICCV 2021) [paper] [code]
-
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding (ICCV 2021) [paper] [website] [code]
-
Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding (CVPR 2021) [paper] [code]
-
Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos (CVPR 2021) [paper] [code]
-
Relation-aware Instance Refinement for Weakly Supervised Visual Grounding (CVPR 2021) [paper] [code]
-
Large-Scale Adversarial Training for Vision-and-Language Representation Learning (NeurIPS 2020) [paper] [code] [poster]
-
Improving One-stage Visual Grounding by Recursive Sub-query Construction (ECCV 2020) [paper] [code]
-
UNITER: UNiversal Image-TExt Representation Learning (ECCV 2020) [paper] [code]
-
Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation (CVPR 2020) [paper] [code]
-
A Real-Time Cross-modality Correlation Filtering Method for Referring Expression Comprehension (CVPR 2020) [paper]
-
Dynamic Graph Attention for Referring Expression Comprehension (ICCV 2019) [paper]
-
A Fast and Accurate One-Stage Approach to Visual Grounding (ICCV 2019) [paper] [code]
-
Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks (CVPR 2019) [paper]
-
Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction (RSS 2018) [paper] [code]
-
Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding (IJCAI 2018) [paper] [code]
-
MAttNet: Modular Attention Network for Referring Expression Comprehension (CVPR 2018) [paper] [code]
-
Comprehension-Guided Referring Expressions (CVPR 2017) [paper]
-
Modeling Context Between Objects for Referring Expression Understanding (ECCV 2016) [paper]
This part is still in progress.
Name | Paper | Website | Code | Train/Eval | Notes |
---|---|---|---|---|---|
Described Object Detection: Liberating Object Detection with Flexible Expressions (NeurIPS 2023) | - | Github | eval only | - | |
OmniLabel | OmniLabel: A Challenging Benchmark for Language-Based Object Detection (ICCV 2023) | Project | Github | eval only | - |
Name | Paper | Task | Website | Code | Train/Eval | Notes |
---|---|---|---|---|---|---|
Bamboo | Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy | OD | - | Github | detector pretraining | build upon public datasets; 69M image classification annotations and 32M object bounding boxes |
BigDetection | BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training (CVPR 2022 Workshop) | OD | - | Github | detector pretraining | - |
Object365 | Objects365: A Large-Scale, High-Quality Dataset for Object Detection (ICCV 2019) | OD | Link | BAAI platform for download | detector pretraining; train & eval | - |
OpenImages | - | OD | Link | Tensorflow API | train & eval | - |
LVIS | LVIS: A Dataset for Large Vocabulary Instance Segmentation (CVPR 2019) | OD&OVD | Link | Github | train & eval | long-tail; federated annotation; also used for OVD |
COCO | Microsoft COCO: Common Objects in Context (ECCV 2014) | OD&OVD | Link | Github | train & eval | also used for OVD |
VOC | The PASCAL Visual Object Classes (VOC) Challenge (IJCV 2010) | OD | Link | - | train & eval | - |
Some survey papers regarding relevant tasks (open-vocabulary learning, etc.)
- Towards Open Vocabulary Learning: A Survey (arxiv 2023) [paper] [repo]
- A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future (arxiv 2023) [paper]
- Referring Expression Comprehension: A Survey of Methods and Datasets (TMM 2020) [paper]
Some similar github repos like awesome lists:
- daqingliu/awesome-rec: A curated list of REC papers. Not maintained in recent years.
- qy-feng/awesome-visual-grounding: A curated list of visual grounding papers. Not maintained in recent years.
- MarkMoHR/Awesome-Referring-Image-Segmentation: A list of Referring Expression Segmentation (RES) papers and resources.
- TheShadow29/awesome-grounding: A list of visual grounding (REC) paper roadmaps and datasets.
- witnessai/Awesome-Open-Vocabulary-Object-Detection: A list of Open-Vocabulary Object Detection papers.
The structure and format of this repo is inspired by BradyFU/Awesome-Multimodal-Large-Language-Models.