Awesome-Scene-Graph-for-Cross-Modal-Learning

🎨 Introduction

A scene graph is a topological structure representing a scene described in text, image, video, or etc. In this graph, the nodes correspond to object bounding boxes with their category labels and attributes, while the edges represent the pair-wise relationships between objects.

📕 Table of Contents

🎨 Introduction
📕 Table of Contents
🌷 Scene Graph Datasets
🍕 Scene Graph Generation
🥝 Scene Graph Application
🤶 Evaluation Metrics
🐱‍🚀 Miscellaneous
⭐️ Star History

🌷 Scene Graph Datasets

Dataset	Modality	Obj. Class	BBox	Rela. Class	Triplets	Instances
Visual Phrase	Image	8	3,271	9	1,796	2,769
Scene Graph	Image	266	69,009	68	109,535	5,000
VRD	Image	100	-	70	37,993	5,000
Open Images v7	Image	57	3,290,070	329	374,768	9,178,275
Visual Genome	Image	33,877	3,843,636	40,480	2,347,187	108,077
GQA	Image	200	-	100	-	-
VrR-VG	Image	1,600	282,460	117	203,375	58,983
UnRel	Image	-	-	18	76	1,071
SpatialSense	Image	3,679	-	9	13,229	11,569
SpatialVOC2K	Image	20	5,775	34	9,804	2,026
OpenSG	Image (panoptic)	133	-	56	-	49K
AUG	Image (Overhead View)	76	-	61	-	-
STAR	Satellite Imagery	48	219,120	58	400,795	31,096
ReCon1M	Satellite Imagery	60	859,751	64	1,149,342	21,392
SkySenseGPT	Satellite Imagery (Instruction)	-	-	-	-	-
ImageNet-VidVRD	Video	35	-	132	3,219	100
VidOR	Video	80	-	50	-	10,000
Action Genome	Video	35	0.4M	25	1.7M	10,000
AeroEye	Video (Drone-View)	56	-	384	-	2.2M
PVSG	Video (panoptic)	126	-	57	-	400
ASPIRe	Video(Interlacements)	-	-	4.5K	-	1.5K
3D Semantic Scene Graphs (3DSSG)	3D	40	-	-	-	48K
PSG4D	4D	46	-	15	-	-
4D-OR	4D(operating room)	12	-	14	-	-
FACTUAL	Image, Text	4,042	-	1607	40,149	40,369

🍕 Scene Graph Generation

2D (Image) Scene Graph Generation

There are three subtasks:

Predicate classification: given ground-truth labels and bounding boxes of object pairs, predict the predicate label.
Scene graph classification: joint classification of predicate labels and the objects' category given the grounding bounding boxes.
Scene graph detection: detect the objects and their categories, and predict the predicate between object pairs.

LLM-based

Non-LLM-based

Panoptic Scene Graph Generation

Compared with traditional scene graph, each object is grounded by a panoptic segmentation mask in PSG, achieving a compresensive structured scene representation.

Spatio-Temporal (Video) Scene Graph Generation

Spatio-Temporal (Video) Scene Graph Generation, a.k.a, dynamic scene graph generation, aims to provide a detailed and structured interpretation of the whole scene by parsing an event into a sequence of interactions between different visual entities. It ususally involves two subtasks:

Scene graph detection: aims to generate scene graphs for given videos, comprising detection results of subject-object pari and the associatde predicates. The localization of object prediction is considered accurate when the Intersection over Union (IoU) between the prediction and ground truth is greater than 0.5.
Predicate classification: classifiy predicates for given oracle detection results of subject-object pairs.
Noted
Noted: Evaluation is conducted with two settings: ***With Constraint*** and ***No constraints***. In the former the generated graphs are restricted to at most one edge, i.e., each subject-object pair is allowed only one predicate and in the latter, the graphs can have multiple edges. More details can refer to Metrics.

LLM-based

Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms

Non-LLM-based

CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos
OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
Summary
Introduce a new dataset which delves into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects
Action Scene Graphs for Long-Form Understanding of Egocentric Videos
End-to-End Video Scene Graph Generation With Temporal Propagation Transformer
Unbiased scene graph generation in videos
Panoptic Video Scene Graph Generation
Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs
Video Scene Graph Generation from Single-Frame Weak Supervision
Prior Knowledge-driven Dynamic Scene Graph Generation with Causal Inference
Dynamic scene graph generation via temporal prior inference
VRDFormer: End-to-End Video Visual Relation Detection with Transformers
Dynamic Scene Graph Generation via Anticipatory Pre-training
Meta Spatio-Temporal Debiasing for Video Scene Graph Generation
Spatial-temporal transformer for dynamic scene graph generation
Target adaptive context aggregation for video scene graph generation
Video Visual Relation Detection

Audio Scene Graph Generation

3D Scene Graph Generation

Given a 3D point cloud $P \in R^{N×3}$ consisting of $N$ points, we assume there is a set of class-agnostic instance masks $M = {M_1, ..., M_K}$ corresponding to $K$ entities in $P$, 3D Scene Graph Generation aims to map the input 3D point cloud to a reliable semantically structured scene graph $G = {O, R}$. Compared with 2D scene graph Generation, the input of 3D SGG is point cloud.

4D Scene Graph Gnereation

4D Panoptic Scene Graph Generation

Textual Scene Graph Generation

🥝 Scene Graph Application

Image Retrieval

Image Caption

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Introducing new dataset GBC10M
Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labelled graph structure, with nodes of various types. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and open-vocabulary detection models, by building a new dataset, GBC10M, gathering GBC annotations for about 10M images of the CC12M dataset
Transforming Visual Scene Graphs to Image Captions
Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment
UNISON: Unpaired Cross-Lingual Image Captioning
Comprehensive Image Captioning via Scene Graph Decomposition
From Show to Tell: A Survey on Deep Learning-based Image Captioning
Image captioning based on scene graphs: A survey

2D Image Generation

Visual Reasoning

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
SceneGPT: A Language Model for 3D Scene Understanding
Towards Flexible Visual Relationship Segmentation
A single model that seamlessly integrates Visual relationship understanding has been studied separately in human-object interaction (HOI) detection, scene graph generation (SGG), and referring relationships (RR) tasks.
FleVRS leverages the synergy between text and image modalities, to ground various types of relationships from images and use textual features from vision-language models to visual conceptual understanding.
LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models
R2G: Reasoning to Ground in 3D Scenes
Multi-modal Situated Reasoning in 3D Scenes
Introducing a large-scale multimodal situated reasoning dataset, scalably collected leveraging 3D scene graphs and vision-language models (VLMs) across a diverse range of real-world 3D scenes
MSQA includes 251K situated question-answering pairs across 9 distinct question categories, covering complex scenarios and object modalities within 3D scenes. We introduce a novel interleaved multi-modal input setting in our benchmark to provide both texts, images, and point clouds for situation and question description, aiming to resolve ambiguity in describing situations with single-modality inputs (\eg, texts).
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

Enhanced VLM/MLLM

Semantic Compositions Enhance Vision-Language Contrastive Learning
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
New dataset and New Task (Relation Conversation)
we propose a novel task, termed Relation Conversation (ReC), which unifies the formulation of text generation, object localization, and relation comprehension. Based on the unified formulation, we construct the AS-V2 dataset, which consists of 127K high-quality relation conversation samples, to unlock the ReC capability for Multi-modal Large Language Models (MLLMs).
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
New dataset and a unified vision-language model for open-word panoptic visual recognition and understanding
we propose a new large-scale dataset (AS-1B) for open-world panoptic visual recognition and understanding, using an economical semi-automatic data engine that combines the power of off-the-shelf vision/language models and human feedback. Moreover, we develop a unified vision-language foundation model (ASM) for open-world panoptic visual recognition and understanding. Aligning with LLMs, our ASM supports versatile image-text retrieval and generation tasks, demonstrating impressive zero-shot capability.
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
Fine-Grained Semantically Aligned Vision-Language Pre-Training
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Information Extraction

3D Generation

Mitigate Hallucination

Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models
Introducing a benchmark based on scene graph dataset
Specifically, we first provide a systematic definition of relation hallucinations, integrating perspectives from perceptive and cognitive domains. Furthermore, we construct the relation-based corpus utilizing the representative scene graph dataset Visual Genome (VG), from which semantic triplets follow real-world distributions
BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations
Mitigating Hallucination in Visual Language Models with Visual Supervision

Dynamic Environment Guidance

Open Scene Graphs for Open World Object-Goal Navigation
LLM-enhanced Scene Graph Learning for Household Rearrangement
household rearrangement
The household rearrangement task involves spotting misplaced objects in a scene and accommodate them with proper places.
Situational Instructions Database: Task Guidance in Dynamic Environments
Situational Instructions Database (SID)
Situational Instructions Database (SID) is a dataset for dynamic task guidance. It contains situationally-aware instructions for performing a wide range of everyday tasks or completing scenarios in 3D environments. The dataset provides step-by-step instructions for these scenarios which are grounded in the context of the situation. This context is defined through a scenario-specific scene graph that captures the objects, their attributes, and their relations in the environment. The dataset is designed to enable research in the areas of grounded language learning, instruction following, and situated dialogue.
RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation
LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping Robots

Privacy-sensitive Object Identification

Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning

Referring Expression Comprehension

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
A triplet-matching objective to fine-tune the vision-language alignment models.
To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instancelevel similarity matrix. Furthermore, to equip VLA models with the ability of relationship nderstanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships

Video Retrieval

A Review and Efficient Implementation of Scene Graph Generation Metricsl

🤶 Evaluation Metrics

🐱‍🚀 Miscellaneous

Toolkit

Here, we provide some toolkits for parsing scene graphs or other useful tools for referencess.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
assets		assets
README.md		README.md

ChocoWu/Awesome-Scene-Graph-for-CrossModal-Learning

Folders and files

Latest commit

History

Repository files navigation

Awesome-Scene-Graph-for-Cross-Modal-Learning

🎨 Introduction

📕 Table of Contents

🌷 Scene Graph Datasets

🍕 Scene Graph Generation

2D (Image) Scene Graph Generation

LLM-based

Non-LLM-based

Panoptic Scene Graph Generation

Spatio-Temporal (Video) Scene Graph Generation

LLM-based

Non-LLM-based

Audio Scene Graph Generation

3D Scene Graph Generation

4D Scene Graph Gnereation

Textual Scene Graph Generation

🥝 Scene Graph Application

Image Retrieval

Image Caption

2D Image Generation

Visual Reasoning

Enhanced VLM/MLLM

Information Extraction

3D Generation

Mitigate Hallucination

Dynamic Environment Guidance

Privacy-sensitive Object Identification

Referring Expression Comprehension

Video Retrieval

🤶 Evaluation Metrics

🐱‍🚀 Miscellaneous

Toolkit

Workshop

Survey

Insteresting Works

⭐️ Star History

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages