Skip to content

This is a repository for listing papers on scene graph generation and application.

Notifications You must be signed in to change notification settings

ChocoWu/Awesome-Scene-Graph-for-CrossModal-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 

Repository files navigation

Awesome-Scene-Graph-for-Cross-Modal-Learning

🎨 Introduction

A scene graph is a topological structure representing a scene described in text, image, video, or etc. In this graph, the nodes correspond to object bounding boxes with their category labels and attributes, while the edges represent the pair-wise relationships between objects.


📕 Table of Contents


🌷 Scene Graph Datasets

Dataset Modality Obj. Class BBox Rela. Class Triplets Instances
Visual Phrase Image 8 3,271 9 1,796 2,769
Scene Graph Image 266 69,009 68 109,535 5,000
VRD Image 100 - 70 37,993 5,000
Open Images v7 Image 57 3,290,070 329 374,768 9,178,275
Visual Genome Image 33,877 3,843,636 40,480 2,347,187 108,077
GQA Image 200 - 100 - -
VrR-VG Image 1,600 282,460 117 203,375 58,983
UnRel Image - - 18 76 1,071
SpatialSense Image 3,679 - 9 13,229 11,569
SpatialVOC2K Image 20 5,775 34 9,804 2,026
OpenSG Image (panoptic) 133 - 56 - 49K
AUG Image (Overhead View) 76 - 61 - -
STAR Satellite Imagery 48 219,120 58 400,795 31,096
ReCon1M Satellite Imagery 60 859,751 64 1,149,342 21,392
SkySenseGPT Satellite Imagery (Instruction) - - - - -
ImageNet-VidVRD Video 35 - 132 3,219 100
VidOR Video 80 - 50 - 10,000
Action Genome Video 35 0.4M 25 1.7M 10,000
AeroEye Video (Drone-View) 56 - 384 - 2.2M
PVSG Video (panoptic) 126 - 57 - 400
ASPIRe Video(Interlacements) - - 4.5K - 1.5K
3D Semantic Scene Graphs (3DSSG) 3D 40 - - - 48K
PSG4D 4D 46 - 15 - -
4D-OR 4D(operating room) 12 - 14 - -
FACTUAL Image, Text 4,042 - 1607 40,149 40,369


🍕 Scene Graph Generation

2D (Image) Scene Graph Generation

There are three subtasks:

  • Predicate classification: given ground-truth labels and bounding boxes of object pairs, predict the predicate label.
  • Scene graph classification: joint classification of predicate labels and the objects' category given the grounding bounding boxes.
  • Scene graph detection: detect the objects and their categories, and predict the predicate between object pairs.

LLM-based

Non-LLM-based

Panoptic Scene Graph Generation

Compared with traditional scene graph, each object is grounded by a panoptic segmentation mask in PSG, achieving a compresensive structured scene representation.

Spatio-Temporal (Video) Scene Graph Generation

Spatio-Temporal (Video) Scene Graph Generation, a.k.a, dynamic scene graph generation, aims to provide a detailed and structured interpretation of the whole scene by parsing an event into a sequence of interactions between different visual entities. It ususally involves two subtasks:

  • Scene graph detection: aims to generate scene graphs for given videos, comprising detection results of subject-object pari and the associatde predicates. The localization of object prediction is considered accurate when the Intersection over Union (IoU) between the prediction and ground truth is greater than 0.5.
  • Predicate classification: classifiy predicates for given oracle detection results of subject-object pairs.
  • NotedNoted: Evaluation is conducted with two settings: ***With Constraint*** and ***No constraints***. In the former the generated graphs are restricted to at most one edge, i.e., each subject-object pair is allowed only one predicate and in the latter, the graphs can have multiple edges. More details can refer to Metrics.

LLM-based

Non-LLM-based

Audio Scene Graph Generation

3D Scene Graph Generation

Given a 3D point cloud $P \in R^{N×3}$ consisting of $N$ points, we assume there is a set of class-agnostic instance masks $M = {M_1, ..., M_K}$ corresponding to $K$ entities in $P$, 3D Scene Graph Generation aims to map the input 3D point cloud to a reliable semantically structured scene graph $G = {O, R}$. Compared with 2D scene graph Generation, the input of 3D SGG is point cloud.

4D Scene Graph Gnereation

Textual Scene Graph Generation


🥝 Scene Graph Application

Image Retrieval

Image Caption

2D Image Generation

Visual Reasoning

Enhanced VLM/MLLM

Information Extraction

3D Generation

Mitigate Hallucination

Dynamic Environment Guidance

Privacy-sensitive Object Identification

Referring Expression Comprehension

  • Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions Paper Star
    A triplet-matching objective to fine-tune the vision-language alignment models.To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instancelevel similarity matrix. Furthermore, to equip VLA models with the ability of relationship nderstanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships

Video Retrieval


🤶 Evaluation Metrics


🐱‍🚀 Miscellaneous

Toolkit

Here, we provide some toolkits for parsing scene graphs or other useful tools for referencess.

Workshop

Survey

Insteresting Works

⭐️ Star History

Star History Chart