CVPR 2023 Papers: Explore a comprehensive collection of cutting-edge research papers presented at CVPR 2023, the premier computer vision conference. Keep up to date with the latest advances in computer vision and deep learning. Code implementations included. ⭐ the repository for the development of visual intelligence!
Explore the CVPR 2023 online conference list with a comprehensive collection of accepted papers. Access additional resources such as PDFs, Supplementary Material, arXiv links and BibTeX citations for in-depth exploration of the research presented.
Other collections of the best AI conferences
❗ Conference table will be up to date all the time.
Conference | Year |
Computer Vision (CV) | |
ICCV | 2023 |
Speech (SP) | |
ICASSP | 2023 |
INTERSPEECH | 2023 |
Contributions to improve the completeness of this list are greatly appreciated. If you come across any overlooked papers, please feel free to create pull requests, open issues or contact me via email. Your participation is crucial to making this repository even better.
List of sections
- 3D from Multi-View and Sensors
- Image and Video Synthesis and Generation
- Humans: Face, Body, Pose, Gesture, Movement
- Transfer, Meta, Low-Shot, Continual, or Long-Tail Learning
- Recognition: Categorization, Detection, Retrieval
- Vision, Language, and Reasoning
- Low-Level Vision
- Segmentation, Grouping and Shape Analysis
- Deep Learning Architectures and Techniques
- Multi-Modal Learning
- 3D from Single Images
- Medical and Biological Vision, Cell Microscopy
- Video: Action and Event Understanding
- Autonomous Driving
- Self-Supervised or Unsupervised Representation Learning
- Datasets and Evaluation
- Scene Analysis and Understanding
- Adversarial Attack and Defense
- Efficient and Scalable Vision
- Computational Imaging
- Video: Low-Level Analysis, Motion, and Tracking
- Vision Applications and Systems
- Vision and Graphics
- Robotics
- Transparency, Fairness, Accountability, Privacy, Ethics in Vision
- Explainable Computer Vision
- Embodied Vision: Active Agents, Simulation
- Document Analysis and Understanding
- Machine Learning (other than Deep Learning)
- Physics-based Vision and Shape-from-X
- Biometrics
- Optimization Methods (other than Deep Learning)
- Photogrammetry and Remote Sensing
- Computer Vision Theory
- Computer Vision for Social Good
- Others
Title | Repo | Paper | Video |
---|---|---|---|
Towards Universal Fake Image Detectors that Generalize Across Generative Models | |||
Implicit Diffusion Models for Continuous Super-Resolution | |||
High-Fidelity Guided Image Synthesis with Latent Diffusion Models | |||
DBARF: Deep Bundle-Adjusting Generalizable Neural Radiance Fields | |||
Deep Arbitrary-Scale Image Super-Resolution via Scale-Equivariance Pursuit | |||
Balanced Spherical Grid for Egocentric View Synthesis | |||
SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation | |||
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation | |||
Self-guided Diffusion Models | |||
Multi-Concept Customization of Text-to-Image Diffusion | |||
3D-Aware Conditional Image Synthesis | |||
QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity | |||
SceneComposer: Any-Level Semantic Image Synthesis | |||
DiffCollage: Parallel Generation of Large Content with Diffusion Models | |||
Putting People in Their Place: Affordance-Aware Human Insertion into Scenes | |||
Hybrid Neural Rendering for Large-Scale Scenes with Motion Blur | |||
Binary Latent Diffusion | |||
StyleRes: Transforming the Residuals for Real Image Editing with StyleGAN | |||
KD-DLGAN: Data Limited Image Generation via Knowledge Distillation | |||
SeaThru-NeRF: Neural Radiance Fields in Scattering Media | |||
PointAvatar: Deformable Point-based Head Avatars from Videos | |||
3DAvatarGAN: Bridging Domains for Personalized Editable Avatars | |||
Neural Preset for Color Style Transfer | |||
Zero-Shot Generative Model Adaptation via Image-Specific Prompt Learning | |||
DyNCA: Real-Time Dynamic Texture Synthesis using Neural Cellular Automata | |||
Exploring Incompatible Knowledge Transfer in Few-Shot Image Generation | |||
HouseDiffusion: Vector Floorplan Generation via a Diffusion Model with Discrete and Continuous Denoising | |||
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization | |||
RiDDLE: Reversible and Diversified De-Identification with Latent Encryptor | |||
LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation | |||
LipFormer: High-Fidelity and Generalizable Talking Face Generation with A Pre-learned Facial Codebook | |||
Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation | |||
GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis | |||
High-Fidelity Generalized Emotional Talking Face Generation with Multi-Modal Emotion Space Learning | |||
Consistent View Synthesis with Pose-guided Diffusion Models | |||
StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator | |||
Imagic: Text-based Real Image Editing with Diffusion Models | |||
Large-Capacity and Flexible Video Steganography via Invertible Neural Network | |||
Quantitative Manipulation of Custom Attributes on 3D-Aware Image Synthesis | |||
Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis from Monocular Image | |||
CF-Font: Content Fusion for Few-Shot Font Generation | |||
One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field | |||
Unsupervised Domain Adaption with Pixel-Level Discriminator for Image-Aware Layout Generation | |||
Diffusion Probabilistic Model Made Slim | |||
Collaborative Diffusion for Multi-Modal Face Generation and Editing | |||
High-Fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors | |||
Network-Free, Unsupervised Semantic Segmentation with Synthetic Images | |||
Visual Prompt Tuning for Generative Transfer Learning | |||
Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models to Learn Any Unseen Style | |||
Catch Missing Details: Image Reconstruction with Frequency Augmented Variational Autoencoder | |||
Towards Bridging the Performance Gaps of Joint Energy-based Models | |||
GLeaD: Improving GANs with a Generator-Leading Task | |||
Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction | |||
SPARF: Neural Radiance Fields from Sparse and Noisy Poses | |||
DeltaEdit: Exploring Text-Free Training for Text-Driven Image Manipulation | |||
Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis | |||
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation | |||
MaskSketch: Unpaired Structure-guided Masked Image Generation | |||
Affordance Diffusion: Synthesizing Hand-Object Interactions | |||
Interactive Cartoonization with Controllable Perceptual Factors | |||
MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation | |||
Paint by Example: Exemplar-based Image Editing with Diffusion Models | |||
GLIGEN: Open-Set Grounded Text-to-Image Generation | |||
L-CoIns: Language-based Colorization with Instance Awareness | |||
DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation | |||
Evading DeepFake Detectors via Adversarial Statistical Consistency | |||
GlassesGAN: Eyewear Personalization using Synthetic Appearance Discovery and Targeted Subspace Modeling | |||
GP-VTON: Towards General Purpose Virtual Try-on via Collaborative Local-Flow Global-Parsing Learning | |||
Where is My Spot? Few-Shot Image Generation via Latent Subspace Optimization | |||
Regularized Vector Quantization for Tokenized Image Synthesis | |||
EDICT: Exact Diffusion Inversion via Coupled Transformations | |||
Scaling up GANs for Text-to-Image Synthesis | |||
Shape-Aware Text-Driven Layered Video Editing | |||
A Unified Pyramid Recurrent Network for Video Frame Interpolation | |||
TAPS3D: Text-guided 3D Textured Shape Generation from Pseudo Supervision | |||
Fine-grained Face Swapping via Regional GAN Inversion | |||
OTAvatar: One-Shot Talking Face Avatar with Controllable Tri-Plane Rendering | |||
Deep Stereo Video Inpainting | |||
StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer | |||
Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences between Pretrained Generative Models | |||
Unsupervised Volumetric Animation | |||
SINE: SINgle Image Editing with Text-to-Image Diffusion Models | |||
Progressive Disentangled Representation Learning for Fine-grained Controllable Talking Head Synthesis | |||
CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer | |||
DeepVecFont-v2: Exploiting Transformers to Synthesize Vector Fonts with Higher Quality | |||
LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization | |||
SINE: Semantic-Driven Image-based NeRF Editing with Prior-guided Editing Field | |||
Exploring Intra-Class Variation Factors with Learnable Cluster Prompts for Semi-Supervised Image Synthesis | |||
Image Cropping with Spatial-Aware Feature and Rank Consistency | |||
Picture that Sketch: Photorealistic Image Generation from Abstract Sketches | |||
MonoHuman: Animatable Human Neural Field from Monocular Video | |||
PixHt-Lab: Pixel Height based Light Effect Generation for Image Compositing | |||
Neural Pixel Composition for 3D-4D View Synthesis from Multi-Views | |||
SpaText: Spatio-Textual Representation for Controllable Image Generation | |||
Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation | |||
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation | |||
Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement | |||
Video Probabilistic Diffusion Models in Projected Latent Space | |||
Variational Distribution Learning for Unsupervised Text-to-Image Generation | |||
Linking Garment with Person via Semantically Associated Landmarks for Virtual Try-On | |||
UV Volumes for Real-Time Rendering of Editable Free-View Human Performance | |||
Null-Text Inversion for Editing Real Images using Guided Diffusion Models | |||
Polynomial Implicit Neural Representations for Large Diverse Datasets | |||
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation | |||
Conditional Image-to-Video Generation with Latent Flow Diffusion Models | |||
Local 3D Editing via 3D Distillation of CLIP Knowledge | |||
Private Image Generation with Dual-Purpose Auxiliary Classifier | |||
MAGVIT: Masked Generative Video Transformer | |||
Dimensionality-Varying Diffusion Process | |||
VIVE3D: Viewpoint-Independent Video Editing using 3D-Aware GANs | |||
LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data | |||
DATID-3D: Diversity-Preserved Domain Adaptation using Text-to-Image Diffusion for 3D Generative Model | |||
Delving StyleGAN Inversion for Image Editing: A Foundation Latent Space Viewpoint | |||
High-Fidelity and Freely Controllable Talking Head Video Generation | |||
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation | |||
StyleRF: Zero-Shot 3D Style Transfer of Neural Radiance Fields | |||
MOSO: Decomposing MOtion, Scene and Object for Video Prediction | |||
Multi Domain Learning for Motion Magnification | |||
GazeNeRF: 3D-Aware Gaze Redirection with Neural Radiance Fields | |||
Hierarchical B-frame Video Coding using Two-Layer CANF without Motion Coding | |||
Blemish-Aware and Progressive Face Retouching with Limited Paired Data | |||
Text-guided Unsupervised Latent Transformation for Multi-attribute Image Manipulation | |||
NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models | |||
Fix the Noise: Disentangling Source Feature for Controllable Domain Translation | |||
Class-Balancing Diffusion Models | |||
DPE: Disentanglement of Pose and Expression for General Video Portrait Editing | |||
Inversion-based Style Transfer with Diffusion Models | |||
Deep Curvilinear Editing: Commutative and Nonlinear Image Manipulation for Pretrained Deep Generative Model | |||
FlowGrad: Controlling the Output of Generative ODEs with Gradients | |||
Graph Transformer GANs for Graph-Constrained House Generation | |||
Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer | |||
Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars | |||
Ham2Pose: Animating Sign Language Notation into Pose Sequences | |||
Neural Transformation Fields for Arbitrary-Styled Font Generation | |||
LayoutDM: Transformer-based Diffusion Model for Layout Generation | |||
Removing Objects from Neural Radiance Fields | |||
Person Image Synthesis via Denoising Diffusion Model | |||
AdaptiveMix: Improving GAN Training via Feature Space Shrinkage | |||
Learning Joint Latent Space EBM Prior Model for Multi-Layer Generator | |||
3D Neural Field Generation using Triplane Diffusion | |||
OmniAvatar: Geometry-guided Controllable 3D Head Synthesis | |||
RWSC-Fusion: Region-Wise Style-Controlled Fusion Network for the Prohibited X-ray Security Image Synthesis | |||
ObjectStitch: Object Compositing with Diffusion Model | |||
Persistent Nature: A Generative Model of Unbounded 3D Worlds | |||
Masked and Adaptive Transformer for Exemplar based Image Translation | |||
Spider GAN: Leveraging Friendly Neighbors to Accelerate GAN Training | |||
Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild | |||
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models | |||
All are Worth Words: A ViT Backbone for Diffusion Models | |||
Few-Shot Semantic Image Synthesis with Class Affinity Transfer | |||
Blowing in the Wind: CycleNet for Human Cinemagraphs from Still Images | |||
StyleGene: Crossover and Mutation of Region-Level Facial Genes for Kinship Face Synthesis | |||
MixNeRF: Modeling a Ray with Mixture Density for Novel View Synthesis from Sparse Inputs | |||
MoStGAN-V: Video Generation with Temporal Motion Styles | |||
Frame Interpolation Transformer and Uncertainty Guidance | |||
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers | |||
HOLODIFFUSION: Training a 3D Diffusion Model using 2D Images | |||
Neural Texture Synthesis with Guided Correspondence | |||
PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360° | |||
InstructPix2Pix: Learning to Follow Image Editing Instructions | |||
Unpaired Image-to-Image Translation with Shortest Path Regularization | |||
Freestyle Layout-to-Image Synthesis | |||
On Distillation of Guided Diffusion Models | |||
Single Image Backdoor Inversion via Robust Smoothed Classifiers | |||
Make-a-Story: Visual Memory Conditioned Consistent Story Generation | |||
Towards Practical Plug-and-Play Diffusion Models | |||
Efficient Scale-Invariant Generator with Column-Row Entangled Pixel Synthesis | |||
Wavelet Diffusion Models are Fast and Scalable Image Generators | |||
3D GAN Inversion with Facial Symmetry Prior | |||
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert | |||
PCT-Net: Full Resolution Image Harmonization Using Pixel-Wise Color Transformations | |||
ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts | |||
Video Compression with Entropy-Constrained Neural Representations | |||
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models | |||
CoralStyleCLIP: Co-optimized Region and Layer Selection for Image Editing | |||
Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding | |||
Sequential Training of GANs Against GAN-classifiers Reveals Correlated Knowledge Gaps Present among Independently Trained GAN Instances |
|||
Attribute-Preserving Face Dataset Anonymization via Latent Code Optimization | |||
Shifted Diffusion for Text-to-Image Generation | |||
HandsOff: Labeled Dataset Generation with no Additional Human Annotations | |||
Lookahead Diffusion Probabilistic Models for Refining Mean Estimation | |||
Imagen Editor and EditBench: Advancing and Evaluating Text-guided Image Inpainting | |||
Re-GAN: Data-Efficient GANs Training via Architectural Reconfiguration | |||
BBDM: Image-to-Image Translation with Brownian Bridge Diffusion Models | |||
VectorFusion: Text-to-SVG by Abstracting Pixel-based Diffusion Models |
Title | Repo | Paper | Video |
---|---|---|---|
Micron-BERT: BERT-based Facial Micro-Expression Recognition | |||
NIKI: Neural Inverse Kinematics with Invertible Neural Networks for 3D Human Pose and Shape Estimation | |||
A Characteristic Function-based Method for Bottom-Up Human Pose Estimation | |||
Executing your Commands via Motion Diffusion in Latent Space | |||
MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID | |||
Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation | |||
Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation | |||
Dynamic Aggregated Network for Gait Recognition | |||
Object Pop-Up: Can We Infer 3D Objects and Their Poses from Human Interactions Alone? | |||
Unsupervised Sampling Promoting for Stochastic Human Trajectory Prediction | |||
ECON: Explicit Clothed humans Optimized via Normal integration | |||
Neuron Structure Modeling for Generalizable Remote Physiological Measurement | |||
Continuous Sign Language Recognition with Correlation Network | |||
Parametric Implicit Face Representation for Audio-Driven Facial Reenactment | |||
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model | |||
PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation | |||
3D Human Mesh Estimation from Virtual Markers | |||
3D Human Pose Estimation via Intuitive Physics | |||
ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation | |||
Generating Holistic 3D Human Motion from Speech | |||
HARP: Personalized Hand Reconstruction from a Monocular RGB Video | |||
Learning Locally Editable Virtual Humans | |||
Reconstructing Signing Avatars from Video using Linguistic Priors | |||
DrapeNet: Garment Generation and Self-Supervised Draping | |||
X-Avatar: Expressive Human Avatars | |||
Hi4D: 4D Instance Segmentation of Close Human Interaction | |||
Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-Supervised Scene Decomposition | |||
CloSET: Modeling Clothed Humans on Continuous Surface with Explicit Template Decomposition | |||
Graphics Capsule: Learning Hierarchical 3D Face Representations from 2D Images | |||
Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition | |||
HandNeRF: Neural Radiance Fields for Animatable Interacting Hands | |||
Relightable Neural Human Assets from Multi-View Gradient Illuminations | |||
Being Comes from Not-being: Open-Vocabulary Text-to-Motion Generation with Wordless Training | |||
DeFeeNet: Consecutive 3D Human Motion Prediction with Deviation Feedback | |||
BioNet: A Biologically-Inspired Network for Face Recognition | |||
Boosting Detection in Crowd Analysis via Underutilized Output Features | |||
Learning Analytical Posterior Probability for Human Mesh Recovery | |||
Listening Human Behavior: 3D Human Pose Estimation with Acoustic Signals | |||
Detecting and Grounding Multi-Modal Media Manipulation | |||
RelightableHands: Efficient Neural Relighting of Articulated Hand Models | |||
MEGANE: Morphable Eyeglass and Avatar Network | |||
SunStage: Portrait Reconstruction and Relighting using the Sun as a Light Stage | |||
TryOnDiffusion: A Tale of Two UNets | |||
Semi-Supervised Hand Appearance Recovery via Structure Disentanglement and Dual Adversarial Discrimination | |||
POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery | |||
Scene-Aware Egocentric 3D Human Pose Estimation | |||
PSVT: End-to-End Multi-Person 3D Pose and Shape Estimation with Progressive Video Transformers | |||
Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting | |||
A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation from a Single RGB Image | |||
TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments | |||
Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry | |||
Generating Human Motion from Textual Descriptions with Discrete Representations | |||
Learning Human Mesh Recovery in 3D Scenes | |||
AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction | |||
3D-Aware Face Swapping | |||
Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos | |||
GFPose: Learning 3D Human Pose Prior with Gradient Fields | |||
Rethinking Feature-based Knowledge Distillation for Face Recognition | |||
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer | |||
Towards Stable Human Pose Estimation via Cross-View Fusion and Foot Stabilization | |||
Ego-Body Pose Estimation via Ego-Head Pose Estimation | |||
TOPLight: Lightweight Neural Networks with Task-Oriented Pretraining for Visible-Infrared Recognition | |||
StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping | |||
Improving Fairness in Facial Albedo Estimation via Visual-Textual Cues | |||
FLEX: Full-Body Grasping without Full-Body Grasps | |||
EDGE: Editable Dance Generation From Music | |||
Complete 3D Human Reconstruction from a Single Incomplete Image | |||
Zero-Shot Pose Transfer for Unrigged Stylized 3D Characters | |||
Hand Avatar: Free-Pose Hand Animation and Rendering from Monocular Video | |||
Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes | |||
Learning Neural Proto-Face Field for Disentangled 3D Face Modeling in the Wild | |||
CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose | |||
Invertible Neural Skinning | |||
DiffusionRig: Learning Personalized Priors for Facial Appearance Editing | |||
Harmonious Feature Learning for Interactive Hand-Object Pose Estimation | |||
Leapfrog Diffusion Model for Stochastic Trajectory Prediction | |||
NeuFace: Realistic 3D Neural Face Rendering from Multi-View Images | |||
DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion | |||
GFIE: A Dataset and Baseline for Gaze-Following from 2D to 3D in Indoor Environments | |||
Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos | |||
Decompose more and Aggregate Better: Two Closer Looks at Frequency Representation Learning for Human Motion Prediction | |||
Human Pose as Compositional Tokens | |||
Normal-guided Garment UV Prediction for Human Re-Texturing | |||
Dynamic Graph Learning with Content-guided Spatial-Frequency Relation Reasoning for Deepfake Detection | |||
VGFlow: Visibility Guided Flow Network for Human Reposing | |||
Mutual Information-based Temporal Difference Learning for Human Pose Estimation in Video | |||
PREIM3D: 3D Consistent Precise Image Attribute Editing from a Single Image | |||
HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation | |||
Implicit Identity Driven Deepfake Face Swapping Detection | |||
Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion | |||
3D-Aware Facial Landmark Detection via Multi-View Consistent Training on Synthetic Data | |||
SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments | |||
Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation | |||
AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation | |||
UDE: A Unified Driving Engine for Human Motion Generation | |||
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior | |||
Semi-Supervised 2D Human Pose Estimation Driven by Position Inconsistency Pseudo Label Correction Module | |||
Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos | |||
HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics | |||
ACR: Attention Collaboration-based Regressor for Arbitrary Two-Hand Reconstruction | |||
HumanBench: Towards General Human-Centric Perception with Projector Assisted Pretraining | |||
CIMI4D: A Large Multimodal Climbing Motion Dataset under Human-Scene Interactions | |||
Human Pose Estimation in Extremely Low-Light Conditions | |||
DistilPose: Tokenized Pose Regression with Heatmap Distillation | |||
Human Body Shape Completion with Implicit Shape and Flow Learning | |||
Source-Free Adaptive Gaze Estimation by Uncertainty Reduction | |||
Music-Driven Group Choreography | |||
Robust Model-based Face Reconstruction through Weakly-Supervised Outlier Segmentation | |||
MARLIN: Masked Autoencoder for Facial Video Representation LearnINg | |||
Transformer-based Unified Recognition of Two Hands Manipulating Objects | |||
Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization | |||
ScarceNet: Animal Pose Estimation with Scarce Annotations | |||
FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction | |||
MoDi: Unconditional Motion Synthesis from Diverse Data | |||
Feature Representation Learning with Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition | |||
MeMaHand: Exploiting Mesh-Mano Interaction for Single Image Two-Hand Reconstruction | |||
Stimulus Verification is a Universal and Effective Sampler in Multi-Modal Human Trajectory Prediction | |||
TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers | |||
Handy: Towards a High Fidelity 3D Hand Shape and Appearance Model | |||
CIRCLE: Capture in Rich Contextual Environments | |||
Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention | |||
Implicit Neural Head Synthesis via Controllable Local Deformation Fields | |||
Continuous Intermediate Token Learning with Implicit Motion Manifold for Keyframe based Motion Interpolation | |||
JRDB-Pose: A Large-Scale Dataset for Multi-Person Pose Estimation and Tracking | |||
STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection | |||
GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-View Images | |||
Decoupled Multimodal Distilling for Emotion Recognition | |||
HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions | |||
ReDirTrans: Latent-to-Latent Translation for Gaze and Head Redirection | |||
QPGesture: Quantization-based and Phase-guided Motion Matching for Natural Speech-Driven Gesture Generation | |||
Multi-Modal Gait Recognition via Effective Spatial-Temporal Feature Fusion | |||
Probabilistic Knowledge Distillation of Face Ensembles | |||
Learning Semantic-Aware Disentangled Representation for Flexible 3D Human Body Editing | |||
Parameter Efficient Local Implicit Image Function Network for Face Segmentation | |||
HumanGen: Generating Human Radiance Fields with Explicit Priors | |||
Biomechanics-guided Facial Action Unit Detection through Force Modeling | |||
Decoupling Human and Camera Motion from Videos in the Wild | |||
Overcoming the Trade-Off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction | |||
Instant-NVR: Instant Neural Volumetric Rendering for Human-Object Interactions from Monocular RGBD Stream | |||
PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation | |||
Analyzing and Diagnosing Pose Estimation with Attributions | |||
Unsupervised Visible-Infrared Person Re-Identification via Progressive Graph Matching and Alternate Learning | |||
Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification | |||
Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition | |||
Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model | |||
Local Connectivity-based Density Estimation for Face Clustering | |||
SelfME: Self-Supervised Motion Learning for Micro-Expression Recognition | |||
Detecting Human-Object Contact in Images | |||
Controllable Light Diffusion for Portraits | |||
InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds | |||
NeMo: 3D Neural Motion Fields from Multiple Video Instances of the Same Action | |||
Privacy-Preserving Adversarial Facial Features | |||
Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation | |||
DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment | |||
Clothed Human Performance Capture with a Double-Layer Neural Radiance Fields | |||
Continuous Landmark Detection with 3D Queries | |||
Learning a 3D Morphable Face Reflectance Model from Low-Cost Data | |||
AUNet: Learning Relations between Action Units for Face Forgery Detection | |||
3D Human Pose Estimation with Spatio-Temporal Criss-Cross Attention | |||
Implicit 3D Human Mesh Recovery using Consistency with Pose and Shape from Unseen-View | |||
3D Human Keypoints Estimation from Point Clouds in the Wild without Human Labels | |||
Multi-Label Compound Expression Recognition: C-EXPR Database & Network | |||
FlexNeRF: Photorealistic Free-Viewpoint Rendering of Moving Humans from Sparse Views | |||
Two-Stage Co-Segmentation Network based on Discriminative Representation for Recovering Human Mesh from Videos | |||
Co-Speech Gesture Synthesis by Reinforcement Learning with Contrastive Pre-trained Rewards | |||
FeatER: An Efficient Network for Human Reconstruction via Feature Map-based TransformER |
Title | Repo | Paper | Video |
---|---|---|---|
Dynamically Instance-guided Adaptation: A Backward-free Approach for Test-Time Domain Adaptive Semantic Segmentation | |||
DETR with Additional Global Aggregation for Cross-Domain Weakly Supervised Object Detection | |||
Mind the Label Shift of Augmentation-based Graph OOD Generalization | |||
Long-Tailed Visual Recognition via Self-Heterogeneous Integration with Knowledge Excavation | |||
Understanding and Improving Visual Prompting: A Label-Mapping Perspective | |||
A Whac-A-Mole Dilemma: Shortcuts Come in Multiples where Mitigating One Amplifies Others | |||
Improved Distribution Matching for Dataset Condensation | |||
Divide and Adapt: Active Domain Adaptation via Customized Learning | |||
Class Relationship Embedded Learning for Source-Free Unsupervised Domain Adaptation | |||
Diversity-Aware Meta Visual Prompting | |||
Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection | |||
Zero-Shot Object Counting | |||
Learning with Fantasy: Semantic-Aware Virtual Contrastive Constraint for Few-Shot Class-Incremental Learning | |||
Distribution Shift Inversion for Out-of-Distribution Prediction | |||
Endpoints Weight Fusion for Class Incremental Semantic Segmentation | |||
Promoting Semantic Connectivity: Dual Nearest Neighbors Contrastive Learning for Unsupervised Domain Generalization | |||
Class-Conditional Sharpness-Aware Minimization for Deep Long-tailed Recognition | |||
Meta-Causal Learning for Single Domain Generalization | |||
VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval | |||
Learning Imbalanced Data with Vision Transformers | |||
Sharpness-Aware Gradient Matching for Domain Generalization | |||
Geometry and Uncertainty-Aware 3D Point Cloud Class-Incremental Semantic Segmentation | |||
Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation | |||
Regularizing Second-Order Influences for Continual Learning | |||
I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification | |||
FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding | |||
Dense Network Expansion for Class Incremental Learning | |||
Batch Model Consolidation: A Multi-Task Model Consolidation Framework | |||
DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection | |||
Supervised Masked Knowledge Distillation for Few-Shot Transformers | |||
ALOFT: A Lightweight MLP-Like Architecture with Dynamic Low-Frequency Transform for Domain Generalization | |||
ZegCLIP: Towards Adapting CLIP for Zero-Shot Semantic Segmentation | |||
DiGA: Distil to Generalize and then Adapt for Domain Adaptive Semantic Segmentation | |||
Adjustment and Alignment for Unbiased Open Set Domain Adaptation | |||
Adapting Shortcut with Normalizing Flow: An Efficient Tuning Framework for Visual Recognition | |||
CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning | |||
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning | |||
Generalizing Dataset Distillation via Deep Generative Prior | |||
Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment | |||
Multi-Centroid Task Descriptor for Dynamic Class Incremental Inference | |||
DAA: A Delta Age AdaIN Operation for Age Estimation via Binary Code Transformer | |||
Bilateral Memory Consolidation for Continual Learning | |||
Texts as Images in Prompt Tuning for Multi-Label Image Recognition | |||
Learning Transformations To Reduce the Geometric Shift in Object Detection | |||
CLIP the Gap: A Single Domain Generalization Approach for Object Detection | |||
Transfer Knowledge from Head to Tail: Uncertainty Calibration under Long-tailed Distribution | |||
Bi-Directional Distribution Alignment for Transductive Zero-Shot Learning | |||
DARE-GRAM: Unsupervised Domain Adaptation Regression by Aligning Inverse Gram Matrices | |||
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models | |||
Open-Set Likelihood Maximization for Few-Shot Learning | |||
WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation | |||
Federated Domain Generalization with Generalization Adjustment | |||
ProtoCon: Pseudo-Label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-Supervised Learning | |||
DA-DETR: Domain Adaptive Detection Transformer with Information Fusion | |||
Harmonious Teacher for Cross-Domain Object Detection | |||
AutoLabel: CLIP-based Framework for Open-Set Video Domain Adaptation | |||
Task Difficulty Aware Parameter Allocation & Regularization for Lifelong Learning | |||
Revisiting Prototypical Network for Cross Domain Few-Shot Learning | |||
Federated Incremental Semantic Segmentation | |||
Semantic Prompt for Few-Shot Image Recognition | |||
Rethinking Gradient Projection Continual Learning: Stability/Plasticity Feature Space Decoupling | |||
No One Left Behind: Improving the Worst Categories in Long-Tailed Learning | |||
Meta Omnium: A Benchmark for General-Purpose Learning-to-Learn | |||
Transductive Few-Shot Learning with Prototype-based Label Propagation by Iterative Graph Refinement | |||
COT: Unsupervised Domain Adaptation with Clustering and Optimal Transport | |||
Semi-Supervised Domain Adaptation with Source Label Adaptation | |||
MetaMix: Towards Corruption-Robust Continual Learning with Temporally Self-Adaptive Data Transformation | |||
Visual-Language Prompt Tuning with Knowledge-guided Context Optimization | |||
Modeling Inter-Class and Intra-Class Constraints in Novel Class Discovery | |||
Real-Time Evaluation in Online Continual Learning: A New Hope | |||
Partial Network Cloning | |||
Rebalancing Batch Normalization for Exemplar-based Class-Incremental Learning | |||
EcoTTA: Memory-Efficient Continual Test-Time Adaptation via Self-Distilled Regularization | |||
Feature Alignment and Uniformity for Test Time Adaptation | |||
Bootstrap Your Own Prior: Towards Distribution-Agnostic Novel Class Discovery | |||
Towards Realistic Long-Tailed Semi-Supervised Learning: Consistency Is All You Need | |||
Balanced Product of Calibrated Experts for Long-Tailed Recognition | |||
Unsupervised Continual Semantic Adaptation through Neural Rendering | |||
Computationally Budgeted Continual Learning: What Does Matter? | |||
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning | |||
Ground-Truth Free Meta-Learning for Deep Compressive Sampling | |||
Multi-Level Logit Distillation | |||
StyleAdv: Meta Style Adversarial Training for Cross-Domain Few-Shot Learning | |||
MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation | |||
On the Stability-Plasticity Dilemma of Class-Incremental Learning | |||
TeSLA: Test-Time Self-Learning With Automatic Adversarial Augmentation | |||
MHPL: Minimum Happy Points Learning for Active Source Free Domain Adaptation | |||
CIGAR: Cross-Modality Graph Reasoning for Domain Adaptive Object Detection | |||
Adaptive Plasticity Improvement for Continual Learning | |||
Achieving a Better Stability-Plasticity Trade-Off via Auxiliary Networks in Continual Learning | |||
Few-Shot Geometry-Aware Keypoint Localization | |||
Spatio-Temporal Pixel-Level Contrastive Learning-based Source-Free Domain Adaptation for Video Semantic Segmentation | |||
Both Style and Distortion Matter: Dual-Path Unsupervised Domain Adaptation for Panoramic Semantic Segmentation | |||
Bi-Level Meta-Learning for Few-Shot Domain Generalization | |||
Few-Shot Referring Relationships in Videos | |||
Exploring Data Geometry for Continual Learning | |||
Masked Images Are Counterfactual Samples for Robust Fine-Tuning | |||
DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning | |||
CoMFormer: Continual Learning in Semantic and Panoptic Segmentation | |||
Global and Local Mixture Consistency Cumulative Learning for Long-tailed Visual Recognitions | |||
Class Attention Transfer based Knowledge Distillation | |||
Hard Sample Matters a Lot in Zero-Shot Quantization | |||
Back to the Source: Diffusion-Driven Adaptation to Test-Time Corruption | |||
SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail | |||
Architecture, Dataset and Model-Scale Agnostic Data-Free Meta-Learning | |||
Preserving Linear Separability in Continual Learning by Backward Feature Projection | |||
Upcycling Models under Domain and Category Shift | |||
Class-Incremental Exemplar Compression for Class-Incremental Learning | |||
Learning Conditional Attributes for Compositional Zero-Shot Learning | |||
BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning | |||
NoisyTwins: Class-Consistent and Diverse Image Generation through StyleGANs | |||
Semi-Supervised Learning Made Simple with Self-Supervised Clustering | |||
Guiding Pseudo-Labels with Uncertainty Estimation for Source-Free Unsupervised Domain Adaptation | |||
PCR: Proxy-based Contrastive Replay for Online Class-Incremental Continual Learning | |||
Modality-Agnostic Debiasing for Single Domain Generalization | |||
Robust Mean Teacher for Continual and Gradual Test-Time Adaptation | |||
Foundation Model Drives Weakly Incremental Learning for Semantic Segmentation | |||
Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-Shot Learning with Hyperspherical Embeddings | |||
Robust Test-Time Adaptation in Dynamic Scenarios | |||
Source-Free Video Domain Adaptation with Spatial-Temporal-Historical Consistency Learning | |||
Heterogeneous Continual Learning | |||
Continual Detection Transformer for Incremental Object Detection | |||
NIFF: Alleviating Forgetting in Generalized Few-Shot Object Detection via Neural Instance Feature Forging | |||
ViewNet: A Novel Projection-based Backbone with View Pooling for Few-Shot Point Cloud Classification | |||
C-SFDA: A Curriculum Learning Aided Self-Training Framework for Efficient Source Free Domain Adaptation | |||
Train/Test-Time Adaptation with Retrieval | |||
Dealing with Cross-Task Class Discrimination in Online Continual Learning | |||
Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning | |||
Decoupling Learning and Remembering: A Bilevel Memory Framework with Knowledge Projection for Task-Incremental Learning | |||
Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation | |||
TIPI: Test Time Adaptation with Transformation Invariance | |||
Meta-Learning with a Geometry-Adaptive Preconditioner | |||
Meta-Tuning Loss Functions and Data Augmentation for Few-Shot Object Detection | |||
A Probabilistic Framework for Lifelong Test-Time Adaptation | |||
Few-Shot Class-Incremental Learning via Class-Aware Bilateral Distillation | |||
CafeBoost: Causal Feature Boost to Eliminate Task-Induced Bias for Class Incremental Learning | |||
A Strong Baseline for Generalized Few-Shot Semantic Segmentation | |||
Towards Better Stability and Adaptability: Improve Online Self-Training for Model Adaptation in Semantic Segmentation | |||
A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation | |||
Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning | |||
Principles of Forgetting in Domain-Incremental Semantic Segmentation in Adverse Weather Conditions | |||
Data-Free Knowledge Distillation via Feature Exchange and Activation Region Constraint | |||
(ML)2P-Encoder: On Exploration of Channel-Class Correlation for Multi-Label Zero-Shot Learning | |||
Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models | |||
Simulated Annealing in Early Layers Leads to Better Generalization | |||
A Data-based Perspective on Transfer Learning | |||
Learning Expressive Prompting with Residuals for Vision Transformers | |||
Boosting Transductive Few-Shot Fine-Tuning with Margin-based Uncertainty Weighting and Probability Regularization | |||
Improving Generalization with Domain Convex Game | |||
Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective | |||
Guided Recommendation for Model Fine-Tuning | |||
Improving Generalization of Meta-Learning with Inverted Regularization at Inner-Level | |||
Hint-Aug: Drawing Hints from Foundation Vision Transformers towards Boosted Few-Shot Parameter-Efficient Tuning |
Title | Repo | Paper | Video |
---|---|---|---|
R2Former: Unified Retrieval and Reranking Transformer for Place Recognition | |||
Mask-Free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations | |||
StructVPR: Distill Structural Knowledge with Weighting Samples for Visual Place Recognition | |||
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining | |||
One-to-Few Label Assignment for End-to-End Dense Detection | |||
Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization | |||
Semi-DETR: Semi-Supervised Object Detection with Detection Transformers | |||
Universal Instance Perception as Object Discovery and Retrieval | |||
CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection | |||
Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection | |||
FrustumFormer: Adaptive Instance-Aware Resampling for Multi-View 3D Detection | |||
Box-Level Active Detection | |||
Learning with Noisy Labels via Self-Supervised Adversarial Noisy Masking | |||
Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection | |||
Aligning Bag of Regions for Open-Vocabulary Object Detection | |||
Asymmetric Feature Fusion for Image Retrieval | |||
3D Video Object Detection with Learnable Object-Centric Global Optimization | |||
Enhanced Training of Query-based Object Detection via Selective Query Recollection | |||
Dense Distinct Query for End-to-End Object Detection | |||
On-the-Fly Category Discovery | |||
ProD: Prompting-to-Disentangle Domain Knowledge for Cross-Domain Few-Shot Image Classification | |||
Q-DETR: An Efficient Low-Bit Quantized Detection Transformer | |||
SAP-DETR: Bridging the Gap between Salient Points and Queries-based Transformer Detector for Fast Model Convergency | |||
An Erudite Fine-grained Visual Classification Model | |||
Self-Supervised Implicit Glyph Attention for Text Recognition | |||
Multi-View Adversarial Discriminator: Mine the Non-Causal Factors for Object Detection in Unseen Domains | |||
HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization | |||
DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets | |||
Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning | |||
Fake it Till You make it: Learning Transferable Representations from Synthetic ImageNet Clones | |||
FFF: Fragment-guided Flexible Fitting for Building Complete Protein Structures | |||
Revisiting Self-Similarity: Structural Embedding for Image Retrieval | |||
Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-based Action Recognition | |||
MixTeacher: Mining Promising Labels with Mixed Scale Teacher for Semi-Supervised Object Detection | |||
Learning Attention as Disentangler for Compositional Zero-Shot Learning | |||
Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration | |||
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection | |||
SOOD: Towards Semi-Supervised Oriented Object Detection | |||
Bias-Eliminating Augmentation Learning for Debiased Federated Learning | |||
Towards Efficient use of Multi-Scale Features in Transformer-based Object Detectors | |||
AsyFOD: An Asymmetric Adaptation Paradigm for Few-Shot Domain Adaptive Object Detection | |||
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching | |||
Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection | |||
Disentangled Representation Learning for Unsupervised Neural Quantization | |||
YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors | |||
Virtual Sparse Convolution for Multimodal 3D Object Detection | |||
TranSG: Transformer-based Skeleton Graph Prototype Contrastive Learning with Structure-Trajectory Prompted Reconstruction for Person Re-Identification | |||
Adaptive Sparse Pairwise Loss for Object Re-Identification | |||
Multi-Granularity Archaeological Dating of Chinese Bronze Dings based on a Knowledge-guided Relation Graph | |||
Event-guided Person Re-Identification via Sparse-Dense Complementary Learning | |||
Vector Quantization with Self-Attention for Quality-Independent Representation Learning | |||
Siamese Image Modeling for Self-Supervised Vision Representation Learning | |||
FCC: Feature Clusters Compression for Long-tailed Visual Recognition | |||
Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information | |||
Soft Augmentation for Image Classification | |||
Correspondence Transformers with Asymmetric Feature Learning and Matching Flow Super-Resolution | |||
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models | |||
Out-of-Distributed Semantic Pruning for Robust Semi-Supervised Learning | |||
Glocal Energy-based Learning for Few-Shot Open-Set Recognition | |||
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data | |||
Deep Factorized Metric Learning | |||
Learning to Detect and Segment for Open Vocabulary Object Detection | |||
ConQueR: Query Contrast Voxel-DETR for 3D Object Detection | |||
Photo Pre-Training, But for Sketch | |||
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | |||
Detecting Everything in the Open World: Towards Universal Object Detection | |||
Twin Contrastive Learning with Noisy Labels | |||
Feature Aggregated Queries for Transformer-based Video Object Detectors | |||
Learning on Gradients: Generalized Artifacts Representation for GAN-Generated Images Detection | |||
Deep Hashing with Minimal-Distance-Separated Hash Centers | |||
Knowledge Combination to Learn Rotated Detection without Rotated Annotation | |||
Good is Bad: Causality Inspired Cloth-Debiasing for Cloth-Changing Person Re-Identification | |||
Discriminating Known from Unknown Objects via Structure-Enhanced Recurrent Variational AutoEncoder | |||
2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection | |||
LINe: Out-of-Distribution Detection by Leveraging Important Neurons | |||
Progressive Transformation Learning for Leveraging Virtual Images in Training | |||
Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection | |||
Decoupling MaxLogit for Out-of-Distribution Detection | |||
Pixels, Regions, and Objects: Multiple Enhancement for Salient Object Detection | |||
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding | |||
BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision | |||
D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-based Transformers | |||
CapDet: Unifying Dense Captioning and Open-World Detection Pretraining | |||
Mapping Degeneration Meets Label Evolution: Learning Infrared Small Target Detection with Single Point Supervision | |||
Generalized UAV Object Detection via Frequency Domain Disentanglement | |||
Deep Frequency Filtering for Domain Generalization | |||
Adaptive Sparse Convolutional Networks with Global Context Enhancement for Faster Object Detection on Drone Images | |||
Improved Test-Time Adaptation for Domain Generalization | |||
Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation | |||
Recurrence without Recurrence: Stable Video Landmark Detection with Deep Equilibrium Models | |||
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision | |||
DETRs with Hybrid Matching | |||
Query-Dependent Video Representation for Moment Retrieval and Highlight Detection | |||
Clothing-Change Feature Augmentation for Person Re-Identification | |||
Learning Attribute and Class-Specific Representation Duet for Fine-grained Fashion Analysis | |||
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks | |||
Optimal Proposal Learning for Deployable End-to-End Pedestrian Detection | |||
DynamicDet: A Unified Dynamic Architecture for Object Detection | |||
Switchable Representation Learning Framework with Self-Compatibility | |||
DATE: Domain Adaptive Product Seeker for E-Commerce | |||
PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery | |||
Dynamic Neural Network for Multi-Task Learning Searching across Diverse Network Topologies | |||
OvarNet: Towards Open-Vocabulary Object Attribute Recognition | |||
HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models | |||
Learning from Noisy Labels with Decoupled Meta Label Purifier | |||
A Light Touch Approach to Teaching Transformers Multi-View Geometry | |||
OpenMix: Exploring Outlier Samples for Misclassification Detection | |||
Revisiting Reverse Distillation for Anomaly Detection | |||
PROB: Probabilistic Objectness for Open World Object Detection | |||
Equiangular Basis Vectors | |||
Weakly Supervised Posture Mining for Fine-grained Classification | |||
An Actor-Centric Causality Graph for Asynchronous Temporal Inference in Group Activity | |||
Weak-Shot Object Detection through Mutual Knowledge Transfer | |||
Zero-Shot Everything Sketch-based Image Retrieval, and in Explainable Style | |||
Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels | |||
Learning Partial Correlation based Deep Visual Representation for Image Classification | |||
Boundary-aware Backward-Compatible Representation via Adversarial Learning in Image Retrieval | |||
PHA: Patch-Wise High-Frequency Augmentation for Transformer-based Person Re-Identification | |||
Unknown Sniffer for Object Detection: Don't Turn a Blind Eye to Unknown Objects | |||
BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation | |||
Annealing-based Label-Transfer Learning for Open World Object Detection | |||
Diversity-Measurable Anomaly Detection | |||
Recurrent Vision Transformers for Object Detection with Event Cameras | |||
AShapeFormer: Semantics-guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers | |||
Ranking Regularization for Critical Rare Classes: Minimizing False Positives at a High True Positive Rate | |||
Contrastive Mean Teacher for Domain Adaptive Object Detectors | |||
Bridging the Gap between Model Explanations in Partially Annotated Multi-Label Classification | |||
PartMix: Regularization Strategy to Learn Part Discovery for Visible-Infrared Person Re-Identification | |||
BiasAdv: Bias-Adversarial Augmentation for Model Debiasing | |||
ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection | |||
Robust 3D Shape Classification via Non-Local Graph Attention Network | |||
Two-Way Multi-Label Loss | |||
Normalizing Flow based Feature Synthesis for Outlier-Aware Object Detection | |||
Object Detection with Self-Supervised Scene Adaptation | |||
Data-Efficient Large Scale Place Recognition with Graded Similarity Supervision | |||
Generating Features with Increased Crop-related Diversity for Few-Shot Object Detection | |||
Recognizing Rigid Patterns of Unlabeled Point Clouds by Complete and Continuous Isometry Invariants with no False Negatives and no False Positives | |||
Deep Semi-Supervised Metric Learning with Mixed Label Propagation | |||
Fine-grained Classification with Noisy Labels |
Title | Repo | Paper | Video |
---|---|---|---|
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning | |||
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models | |||
Iterative Proposal Refinement for Weakly-Supervised Video Grounding | |||
MetaCLUE: Towards Comprehensive Visual Metaphors Research | |||
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation | |||
GeneCIS: A Benchmark for General Conditional Image Similarity | |||
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks | |||
Generative Bias for Robust Visual Question Answering | |||
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method | |||
Gloss Attention for Gloss-Free Sign Language Translation | |||
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos | |||
Generalized Decoding for Pixel, Image, and Language | |||
Accelerating Vision-Language Pretraining with Free Language Modeling | |||
GRES: Generalized Referring Expression Segmentation | |||
BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration | |||
RGB no more: Minimally-decoded JPEG Vision Transformers | |||
Scaling Language-Image Pre-Training via Masking | |||
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding | |||
RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension | |||
Mobile User Interface Element Detection Via Adaptively Prompt Tuning | |||
Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training | |||
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation | |||
Meta Compositional Referring Expression Segmentation | |||
VindLU: A Recipe for Effective Video-and-Language Pretraining | |||
Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning | |||
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods | |||
Learning Customized Visual Models with Retrieval-Augmented Knowledge | |||
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling | |||
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling | |||
NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations | |||
Clover: Towards a Unified Video-Language Alignment and Fusion Model | |||
Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval | |||
Task Residual for Tuning Vision-Language Models | |||
Dream3D: Zero-Shot Text-to-3D Synthesis using 3D Shape Prior and Text-to-Image Diffusion Models | |||
End-to-End 3D Dense Captioning with Vote2Cap-DETR | |||
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training | |||
Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation | |||
Visual Programming: Compositional Visual Reasoning without Training | |||
Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language | |||
Referring Multi-Object Tracking | |||
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis | |||
MIST : Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering | |||
Learning to Segment Every Referring Object Point by Point | |||
Contrastive Grouping with Transformer for Referring Image Segmentation | |||
Prototype-based Embedding Network for Scene Graph Generation | |||
Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding | |||
S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning | |||
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory | |||
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering | |||
Cap4Video: What can Auxiliary Captions do for Text-Video Retrieval? | |||
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning | |||
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning | |||
Zero-Shot Referring Image Segmentation with Global-Local Context Features | |||
Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding | |||
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training | |||
Probabilistic Prompt Learning for Dense Prediction | |||
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment | |||
All in One: Exploring Unified Video-Language Pre-Training | |||
Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding | |||
Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning | |||
ConZIC: Controllable Zero-Shot Image Captioning by Sampling-based Polishing | |||
RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension | |||
KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation | |||
ANetQA: A Large-Scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos | |||
ViLEM: Visual-Language Error Modeling for Image-Text Retrieval | |||
Multi-Modal Representation Learning with Text-Driven Soft Masks | |||
Meta-Personalizing Vision-Language Models to Find Named Instances in Video | |||
ReCo: Region-Controlled Text-to-Image Generation | |||
Are Deep Neural Networks SMARTer than Second Graders? | |||
Graph Representation for Order-Aware Visual Transformation | |||
3D Concept Learning and Reasoning from Multi-View Images | |||
Text with Knowledge Graph Augmented Transformer for Video Captioning | |||
Crossing the Gap: Domain Generalization for Image Captioning | |||
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering | |||
VQACL: A Novel Visual Question Answering Continual Learning Setting | |||
Improving Selective Visual Question Answering by Learning from Your Peers | |||
High-Fidelity 3D Face Generation from Natural Language Descriptions | |||
Language-guided Audio-Visual Source Separation via Trimodal Consistency | |||
Test of Time: Instilling Video-Language Models with a Sense of Time | |||
Learning Situation Hyper-Graphs for Video Question Answering | |||
Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval | |||
Fine-grained Audible Video Description | |||
Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering | |||
A-Cap: Anticipation Captioning with Commonsense Knowledge | |||
Cross-Domain Image Captioning with Discriminative Finetuning | |||
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations | |||
The Dialog Must Go on: Improving Visual Dialog via Generative Self-Training | |||
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! | |||
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP | |||
Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding | |||
Language Adaptive Weight Generation for Multi-Task Visual Grounding | |||
CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment | |||
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers | |||
A Simple Framework for Text-Supervised Semantic Segmentation | |||
Learning to Name Classes for Vision and Language Models | |||
Iterative Vision-and-Language Navigation | |||
Behavioral Analysis of Vision-and-Language Navigation Agents | |||
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval | |||
SynthVSR: Scaling Up Visual Speech Recognition with Synthetic Supervision | |||
METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens | |||
Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning | |||
Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices | |||
Hierarchical Prompt Learning for Multi-Task Learning | |||
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval | |||
SViTT: Temporal Learning of Sparse Video-Text Transformers | |||
How You Feelin'? Learning Emotions and Mental States in Movie Scenes | |||
Logical Implications for Visual Question Answering Consistency | |||
Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks | |||
DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-to-Fine Contrastive Ranking | |||
iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition | |||
Semantic-Conditional Diffusion Networks for Image Captioning | |||
CREPE: Can Vision-Language Foundation Models Reason Compositionally? | |||
RMLVQA: A Margin Loss Approach for Visual Question Answering with Language Biases | |||
Improving Vision-and-Language Navigation by Generating Future-View Image Semantics | |||
Prefix Conditioning Unifies Language and Label Supervision | |||
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning | |||
From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models | |||
Hierarchical Video-Moment Retrieval and Step-Captioning |
Title | Repo | Paper | Video |
---|---|---|---|
Activating More Pixels in Image Super-Resolution Transformer | |||
MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding from Object Detection | |||
Omni Aggregation Networks for Lightweight Image Super-Resolution | |||
Blur Interpolation Transformer for Real-World Motion from Blur | |||
Equivalent Transformation and Dual Stream Network Construction for Mobile Image Super-Resolution | |||
Masked Image Training for Generalizable Deep Image Denoising | |||
CutMIB: Boosting Light Field Super-Resolution via Multi-View Image Blending | |||
Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement | |||
Learning A Sparse Transformer Network for Effective Image Deraining | |||
Deep Discriminative Spatial and Temporal Network for Efficient Video Deblurring | |||
Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions | |||
AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation | |||
Self-Supervised Non-Uniform Kernel Estimation with Flow-based Motion Prior for Blind Image Deblurring | |||
OSRT: Omnidirectional Image Super-Resolution with Distortion-Aware Transformer | |||
Toward Accurate Post-Training Quantization for Image Super Resolution | |||
Learning a Simple Low-Light Image Enhancer from Paired Low-Light Instances | |||
Joint Appearance and Motion Learning for Efficient Rolling Shutter Correction | |||
Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution | |||
Unsupervised Cumulative Domain Adaptation for Foggy Scene Optical Flow | |||
PyramidFlow: High-Resolution Defect Contrastive Localization using Pyramid Normalizing Flow | |||
DR2: Diffusion-based Robust Degradation Remover for Blind Face Restoration | |||
DNF: Decouple and Feedback Network for Seeing in the Dark | |||
Optimization-Inspired Cross-Attention Transformer for Compressive Sensing | |||
Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution | |||
Event-based Frame Interpolation with Ad-hoc Deblurring | |||
Better CMOS Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution |
|||
SMAE: Few-Shot Learning for HDR Deghosting with Saturation-Aware Masked Autoencoders | |||
A Unified HDR Imaging Method with Pixel and Patch Level | |||
DegAE: A New Pretraining Paradigm for Low-Level Vision | |||
CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network with Large Input | |||
Blind Video Deflickering by Neural Filtering with a Flawed Atlas | |||
Efficient and Explicit Modelling of Image Hierarchies for Image Restoration | |||
Learning Distortion Invariant Representation for Image Restoration from a Causality Perspective | |||
Human Guided Ground-truth Generation for Realistic Image Super-Resolution | |||
Raw Image Reconstruction with Learned Compact Metadata | |||
Curricular Contrastive Regularization for Physics-Aware Single Image Dehazing | |||
ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal | |||
N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution | |||
Real-time 6K Image Rescaling with Rate-Distortion Optimization | |||
GamutMLP: A Lightweight MLP for Color Loss Recovery | |||
CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion | |||
Quality-Aware Pre-trained Models for Blind Image Quality Assessment | |||
Recurrent Homography Estimation using Homography-guided Image Warping and Focus Transformer | |||
Learning Spatial-Temporal Implicit Neural Representations for Event-guided Video Super-Resolution | |||
RIDCP: Revitalizing Real Image Dehazing via High-Quality Codebook Priors | |||
Generating Aligned Pseudo-Supervision from Non-Aligned Data for Image Restoration in Under-Display Camera | |||
Structure Aggregation for Cross-Spectral Stereo Image Guided Denoising | |||
Rethinking Optical Flow from Geometric Matching Consistent Perspective | |||
Video Dehazing via a Multi-Range Temporal Alignment Network with Physical Prior | |||
Perception-Oriented Single Image Super-Resolution using Optimal Objective Estimation | |||
Zero-Shot Dual-Lens Super-Resolution | |||
Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring | |||
A Simple Baseline for Video Restoration with Grouped Spatial-Temporal Shift | |||
Learning Generative Structure Prior for Blind Text Image Super-Resolution | |||
Motion Information Propagation for Neural Video Compression | |||
Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time | |||
Event-based Video Frame Interpolation with Cross-Modal Asymmetric Bidirectional Motion Fields | |||
Learning Sample Relationship for Exposure Correction | |||
Spatially Adaptive Self-Supervised Learning for Real-World Image Denoising | |||
Context-Aware Pretraining for Efficient Blind Image Decomposition | |||
Physics-guided ISO-Dependent Sensor Noise Modeling for Extreme Low-Light Photography | |||
AnyFlow: Arbitrary Scale Optical Flow with Implicit Neural Representation | |||
Complexity-guided Slimmable Decoder for Efficient Deep Video Compression | |||
Bitstream-Corrupted JPEG Images are Restorable: Two-Stage Compensation and Alignment Framework for Image Restoration | |||
Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising | |||
Learning from Unique Perspectives: User-Aware Saliency Modeling | |||
DINN360: Deformable Invertible Neural Network for Latitude-Aware 360° Image Rescaling | |||
ABCD: Arbitrary Bitwise Coefficient for De-Quantization | |||
Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning | |||
Learning Steerable Function for Efficient Image Resampling | |||
Revisiting the Stack-based Inverse Tone Mapping | |||
Generative Diffusion Prior for Unified Image Restoration and Enhancement | |||
LG-BPN: Local and Global Blind-Patch Network for Self-Supervised Real-World Denoising | |||
Adaptive Spot-guided Transformer for Consistent Local Feature Matching | |||
SFD2: Semantic-guided Feature Detection and Description | |||
Burstormer: Burst Image Restoration and Enhancement Transformer | |||
DeepLSD: Line Segment Detection and Refinement with Deep Image Gradients | |||
Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement | |||
Structured Sparsity Learning for Efficient Video Super-Resolution | |||
DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos | |||
Exploring Discontinuity for Video Frame Interpolation | |||
Neural Video Compression with Diverse Contexts | |||
FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation | |||
OPE-SR: Orthogonal Position Encoding for Designing a Parameter-Free Upsampling Module in Arbitrary-Scale Image Super-Resolution | |||
Context-based Trit-Plane Coding for Progressive Image Compression | |||
All-in-One Image Restoration for Unknown Degradations using Adaptive Discriminative Filters for Specific Degradations | |||
Learning to Exploit the Sequence-Specific Prior Knowledge for Image Processing Pipelines Optimization | |||
Nighttime Smartphone Reflective Flare Removal using Optical Center Symmetry Prior | |||
Enhancing Deformable Local Features by Jointly Learning to Detect and Describe Keypoints | |||
Real-Time Controllable Denoising for Image and Video | |||
Compression-Aware Video Super-Resolution | |||
Spatial-Frequency Mutual Learning for Face Super-Resolution | |||
The Treasure Beneath Multiple Annotations: An Uncertainty-Aware Edge Detector | |||
Toward Stable, Interpretable, and Lightweight Hyperspectral Super-Resolution | |||
Modernizing Old Photos Using Multiple References via Photorealistic Style Transfer | |||
Data-Driven Feature Tracking for Event Cameras | |||
LVQAC: Lattice Vector Quantization Coupled with Spatially Adaptive Companding for Efficient Learned Image Compression | |||
Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger | |||
Learning to Detect Mirrors from Videos via Dual Correspondences | |||
Robust Unsupervised StyleGAN Image Restoration | |||
Ingredient-oriented Multi-Degradation Learning for Image Restoration | |||
CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution | |||
Semi-Supervised Parametric Real-World Image Harmonization | |||
SmartAssign: Learning a Smart Knowledge Assignment Strategy for Deraining and Desnowing | |||
Robust Single Image Reflection Removal Against Adversarial Attacks | |||
PMatch: Paired Masked Image Modeling for Dense Geometric Matching | |||
Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation | |||
Residual Degradation Learning Unfolding Framework with Mixing Priors Across Spectral and Spatial for Compressive Spectral Imaging | |||
Visual Recognition-Driven Image Restoration for Multiple Degradation with Intrinsic Semantics Recovery | |||
sRGB Real Noise Synthesizing with Neighboring Correlation-Aware Noise Model | |||
Rethinking Image Super Resolution from Long-Tailed Distribution Learning Perspective | |||
Comprehensive and Delicate: An Efficient Transformer for Image Restoration | |||
Super-Resolution Neural Operator | |||
Neumann Network with Recursive Kernels for Single Image Defocus Deblurring | |||
Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection | |||
Learning Rotation-Equivariant Features for Visual Correspondence | |||
Patch-Craft Self-Supervised Training for Correlated Image Denoising | |||
Metadata-based RAW Reconstruction via Implicit Neural Functions | |||
Contrastive Semi-Supervised Learning for Underwater Image Restoration via Reliable Bank | |||
Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning | |||
Spectral Bayesian Uncertainty for Image Super-Resolution | |||
DINER: Disorder-Invariant Implicit Neural Representation | |||
NVTC: Nonlinear Vector Transform Coding | |||
HyperCUT: Video Sequence from a Single Blurry Image using Unsupervised Ordering | |||
You Do Not Need Additional Priors or Regularizers in Retinex-based Low-light Image Enhancement | |||
Learning a Practical SDR-to-HDRTV Up-Conversion using New Dataset and Degradation Models |
Title | Repo | Paper | Video |
---|---|---|---|
Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos | |||
Vision Transformers are Good Mask Auto-Labelers | |||
Visual Recognition by Request | |||
Ultra-High Resolution Segmentation with Ultra-Rich Context: A Novel Benchmark | |||
AttentionShift: Iteratively Estimated Part-based Attention Map for Pointly Supervised Instance Segmentation | |||
MDQE: Mining Discriminative Query Embeddings to Segment Occluded Instances on Challenging Videos | |||
Look Before You Match: Instance Understanding Matters in Video Object Segmentation | |||
SIM: Semantic-Aware Instance Mask Generation for Box-Supervised Instance Segmentation | |||
EFEM: Equivariant Neural Field Expectation Maximization for 3D Object Segmentation without Scene Supervision | |||
Camouflaged Object Detection with Feature Decomposition and Edge Reconstruction | |||
LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding | |||
OneFormer: One Transformer to Rule Universal Image Segmentation | |||
Mask-Free Video Instance Segmentation | |||
Less is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation | |||
InstMove: Instance Motion for Object-Centric Video Segmentation | |||
The Devil is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-guided Mask Representation | |||
Edge-Aware Regional Message Passing Controller for Image Forgery Localization | |||
Interactive Segmentation as Gaussian Process Classification | |||
Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation | |||
Adversarially Masking Synthetic to Mimic Real: Adaptive Noise Injection for Point Cloud Segmentation Adaptation | |||
Generative Semantic Segmentation | |||
Modeling the Distributional Uncertainty for Salient Object Detection Models | |||
Simultaneously Short- and Long-Term Temporal Modeling for Semi-Supervised Video Semantic Segmentation | |||
Out-of-Candidate Rectification for Weakly Supervised Semantic Segmentation | |||
DynaMask: Dynamic Mask Selection for Instance Segmentation | |||
MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving | |||
Generalizable Local Feature Pre-Training for Deformable Shape Analysis | |||
Understanding and Improving Features Learned in Deep Functional Maps | |||
G-MSM: Unsupervised Multi-Shape Matching with Graph-based Affinity Priors | |||
Continual Semantic Segmentation with Automatic Memory Sample Selection | |||
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation | |||
Object Discovery from Motion-guided Tokens | |||
Efficient Mask Correction for Click-based Interactive Image Segmentation | |||
Balancing Logit Variation for Long-tailed Semantic Segmentation | |||
Fuzzy Positive Learning for Semi-Supervised Semantic Segmentation | |||
Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision | |||
Improving Graph Representation for Point Cloud Segmentation via Attentive Filtering | |||
BUOL: A Bottom-Up Framework with Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction from a Single Image | |||
ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation | |||
CCuantuMM: Cycle-Consistent Quantum-Hybrid Matching of Multiple Shapes | |||
Hierarchical Dense Correlation Distillation for Few-Shot Segmentation | |||
UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration | |||
FedSeg: Class-Heterogeneous Federated Learning for Semantic Segmentation | |||
Understanding Imbalanced Semantic Segmentation through Neural Collapse | |||
Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation | |||
PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models | |||
PartDistillation: Learning Parts from Instance Segmentation | |||
Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings | |||
FastInst: A Simple Query-based Model for Real-Time Instance Segmentation | |||
SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation | |||
Semantic Human Parsing via Scalable Semantic Transfer over Multiple Label Domains | |||
Open-Set Semantic Segmentation for Point Clouds via Adversarial Prototype Framework | |||
Hunting Sparsity: Density-guided Contrastive Learning for Semi-Supervised Semantic Segmentation | |||
A Generalized Framework for Video Instance Segmentation | |||
SimpSON: Simplifying Photo Cleanup with Single-Click Distracting Object Segmentation Network | |||
Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning | |||
Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching | |||
Ultrahigh Resolution Image/Video Matting with Spatio-Temporal Sparsity | |||
Style Projected Clustering for Domain Generalized Semantic Segmentation | |||
MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds | |||
Compositor: Bottom-Up Clustering and Compositing for Robust Part and Object Segmentation | |||
Dynamic Focus-Aware Positional Queries for Semantic Segmentation | |||
HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation | |||
Marching-Primitives: Shape Abstraction from Signed Distance Function | |||
Multimodal Industrial Anomaly Detection via Hybrid Fusion | |||
CLIP is also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation | |||
Weakly Supervised Semantic Segmentation via Adversarial Learning of Classifier and Reconstructor | |||
Conjugate Product Graphs for Globally Optimal 2D-3D Shape Matching | |||
Interactive Segmentation of Radiance Fields | |||
Boundary-enhanced Co-Training for Weakly Supervised Semantic Segmentation | |||
Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization | |||
Quantum Multi-Model Fitting | |||
Two-Shot Video Object Segmentation | |||
End-to-End Video Matting with Trimap Propagation | |||
ISBNet: A 3D Point Cloud Instance Segmentation Network with Instance-Aware Sampling and Box-Aware Dynamic Convolution | |||
On Calibrating Semantic Segmentation Models: Analyses and an Algorithm | |||
Explicit Visual Prompting for Low-Level Structure Segmentations | |||
Neural Intrinsic Embedding for Non-rigid Point Cloud Matching | |||
Incrementer: Transformer for Class-Incremental Semantic Segmentation with Knowledge Distillation Focusing on Old Class | |||
Camouflaged Instance Segmentation via Explicit De-Camouflaging | |||
Leveraging Hidden Positives for Unsupervised Semantic Segmentation | |||
Rethinking the Correlation in Few-Shot Segmentation: A Buoys View | |||
Sparsely Annotated Semantic Segmentation with Adaptive Gaussian Mixtures | |||
Mask-guided Matting in the Wild | |||
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention | |||
Conflict-based Cross-View Consistency for Semi-Supervised Semantic Segmentation | |||
Augmentation Matters: A Simple-yet-Effective Approach to Semi-Supervised Semantic Segmentation | |||
Attention-based Point Cloud Edge Sampling | |||
DA Wand: Distortion-Aware Selection using Neural Mesh Parameterization | |||
Extracting Class Activation Maps from Non-Discriminative Features as well | |||
Focused and Collaborative Feedback Integration for Interactive Image Segmentation | |||
Boosting Low-Data Instance Segmentation by Unsupervised Pre-Training with Saliency Prompt | |||
Unsupervised 3D Shape Reconstruction by Part Retrieval and Assembly | |||
MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation | |||
Transformer Scale Gate for Semantic Segmentation | |||
PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers | |||
Side Adapter Network for Open-Vocabulary Semantic Segmentation | |||
Test Time Adaptation with Regularized Loss for Weakly Supervised Salient Object Detection | |||
Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers | |||
Reliability in Semantic Segmentation: Are We on the Right Track? | |||
Beyond mAP: Towards Better Evaluation of Instance Segmentation | |||
Heat Diffusion based Multi-Scale and Geometric Structure-Aware Transformer for Mesh Segmentation | |||
Tree Instance Segmentation with Temporal Contour Graph | |||
Exemplar-FreeSOLO: Enhancing Unsupervised Instance Segmentation with Exemplars | |||
Omnimatte3D: Associating Objects and their Effects in Unconstrained Monocular Video | |||
Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation | |||
Instance-Specific and Model-Adaptive Supervision for Semi-Supervised Semantic Segmentation | |||
Improving Robustness of Semantic Segmentation to Motion-Blur using Class-Centric Augmentation | |||
IFSeg: Image-Free Semantic Segmentation via Vision-Language Model | |||
CLIP-S4: Language-guided Self-Supervised Semantic Segmentation | |||
Pruning Parameterization with Bi-Level Optimization for Efficient Semantic Segmentation on the Edge |
Title | Repo | Paper | Video |
---|---|---|---|
Pix2Map: Cross-Modal Retrieval for Inferring Street Maps from Images | |||
Audio-Visual Grouping Network for Sound Localization from Mixtures | |||
Learning Semantic Relationship among Instances for Image-Text Matching | |||
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors | |||
ImageBind: One Embedding Space to Bind Them All | |||
Learning to Dub Movies via Hierarchical Prosody Models | |||
OmniMAE: Single Model Masked Pretraining on Images and Videos | |||
CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset | |||
Egocentric Audio-Visual Object Localization | |||
Learning Visual Representations via Language-guided Sampling | |||
Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models | |||
iQuery: Instruments as Queries for Audio-Visual Sound Separation | |||
Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-Identification | |||
PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection | |||
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-Shot Learners | |||
Non-Contrastive Learning Meets Language-Image Pre-Training | |||
Highly Confident Local Structure based Consensus Graph Learning for Incomplete Multi-View Clustering | |||
Vision Transformers are Parameter-Efficient Audio-Visual Learners | |||
Teaching Structured Vision & Language Concepts to Vision & Language Models | |||
Data-Free Sketch-based Image Retrieval | |||
Align and Attend: Multimodal Summarization with Dual Contrastive Losses | |||
Efficient Multimodal Fusion via Interactive Prompting | |||
Multimodal Prompting with Missing Modalities for Visual Recognition | |||
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce | |||
What Happened 3 Seconds Ago? Inferring the Past with Thermal Imaging | |||
MMANet: Margin-Aware Distillation and Modality-Aware Regularization for Incomplete Multimodal Learning | |||
Multi-Modal Learning with Missing Modality via Shared-Specific Feature Modelling | |||
The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects | |||
Position-guided Text Prompt for Vision-Language Pre-Training | |||
Conditional Generation of Audio from Video via Foley Analogies | |||
OSAN: A One-Stage Alignment Network to Unify Multimodal Alignment and Unsupervised Domain Adaptation | |||
Self-Supervised Video Forensics by Audio-Visual Anomaly Detection | |||
ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding | |||
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR | |||
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring | |||
SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text | |||
Exploring and Exploiting Uncertainty for Incomplete Multi-View Classification | |||
EXIF as Language: Learning Cross-Modal Associations between Images and Camera Metadata | |||
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens | |||
RONO: Robust Discriminative Learning with Noisy Labels for 2D-3D Cross-Modal Retrieval | |||
CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective | |||
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning | |||
ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Regeneration | |||
Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence | |||
Learning Emotion Representations from Verbal and Nonverbal Communication | |||
Enhanced Multimodal Representation Learning with Cross-Modal KD | |||
MELTR: Meta Loss Transformer for Learning to Fine-Tune Video Foundation Models | |||
Multilateral Semantic Relations Modeling for Image Text Retrieval | |||
GeoVLN: Learning Geometry-enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation | |||
Noisy Correspondence Learning with Meta Similarity Correction | |||
Improving Cross-Modal Retrieval with Set of Diverse Embeddings | |||
Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment | |||
MaPLe: Multi-Modal Prompt Learning | |||
Fine-grained Image-Text Matching by Cross-Modal Hard Aligning Network | |||
Towards Modality-Agnostic Person Re-Identification with Descriptive Query | |||
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos | |||
FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-Training | |||
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-Training Model | |||
Egocentric Auditory Attention Localization in Conversations | |||
Improving Zero-Shot Generalization and Robustness of Multi-Modal Models | |||
Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning | |||
Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles | |||
GCFAgg: Global and Cross-View Feature Aggregation for Multi-View Clustering | |||
BiCro: Noisy Correspondence Rectification for Multi-Modality Data via Bi-Directional Cross-Modal Similarity Consistency | |||
DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training | |||
Referring Image Matting | |||
Leveraging per Image-Token Consistency for Vision-Language Pre-Training | |||
Seeing what You Miss: Vision-Language Pre-Training with Semantic Completion Learning | |||
Sample-Level Multi-View Graph Clustering | |||
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation | |||
On the Effects of Self-Supervision and Contrastive Alignment in Deep Multi-View Clustering | |||
SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model | |||
Novel-View Acoustic Synthesis | |||
MAGVLT: Masked Generative Vision-and-Language Transformer | |||
Reproducible Scaling Laws for Contrastive Language-Image Learning | |||
PMR: Prototypical Modal Rebalance for Multimodal Learning | |||
Language-guided Music Recommendation for Video via Prompt Analogies | |||
RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training | |||
MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition | |||
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning | |||
PRISE: Demystifying Deep Lucas-Kanade with Strongly Star-Convex Constraints for Multimodel Image Alignment | |||
Masked Autoencoding Does Not Help Natural Language Supervision at Scale | |||
CLIPPO: Image-and-Language Understanding from Pixels Only | |||
Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations | |||
Critical Learning Periods for Multisensory Integration in Deep Networks | |||
CLIPPING: Distilling CLIP-based Models with a Student base for Video-Language Retrieval | |||
NUWA-LIP: Language-guided Image Inpainting with Defect-Free VQGAN | |||
WINNER: Weakly-Supervised hIerarchical DecompositioN and aligNment for spatio-tEmporal Video gRounding | |||
Multivariate, Multi-Frequency and Multimodal: Rethinking Graph Neural Networks for Emotion Recognition in Conversation |
Title | Repo | Paper | Video |
---|---|---|---|
3D-Aware Multi-Class Image-to-Image Translation with NeRFs | |||
DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis | |||
MagicPony: Learning Articulated 3D Animals in the Wild | |||
Seeing a Rose in Five Thousand Ways | |||
FitMe: Deep Photorealistic 3D Morphable Model Avatars | |||
Scalable, Detailed and Mask-Free Universal Photometric Stereo | |||
Spatio-Focal Bidirectional Disparity Estimation from a Dual-Pixel Image | |||
ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency | |||
High-Fidelity Clothed Avatar Reconstruction from a Single Image | |||
TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation | |||
Behind the Scenes: Density Fields for Single View Reconstruction | |||
Reconstructing Animatable Categories from Videos | |||
RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation | |||
Self-Supervised Geometry-Aware Encoder for Style-based 3D GAN Inversion | |||
3D Cinemagraphy from a Single Image | |||
NeuralLift-360: Lifting An In-the-Wild 2D Photo to a 3D Object with 360° Views | |||
iDisc: Internal Discretization for Monocular Depth Estimation | |||
HairStep: Transfer Synthetic to Real using Strand and Depth Maps for Single-View 3D Hair Modeling | |||
NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-Shot Real Image Animation | |||
NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization | |||
Multiview Compressive Coding for 3D Reconstruction | |||
FaceLit: Neural 3D Relightable Faces | |||
Rigidity-Aware Detection for 6D Object Pose Estimation | |||
Shape-Constraint Recurrent Flow for 6D Object Pose Estimation | |||
Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild | |||
Ref-NPR: Reference-based Non-Photorealistic Radiance Fields for Controllable Scene Stylization | |||
DiffPose: Toward More Reliable 3D Pose Estimation | |||
High-Fidelity 3D GAN Inversion by Pseudo-Multi-View Optimization | |||
Semantic Scene Completion with Cleaner Self | |||
Learned Two-Plane Perspective Prior based Image Resampling for Efficient Object Detection | |||
Mask3D: Pre-Training 2D Vision Transformers by Learning Masked 3D Priors | |||
Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild | |||
Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning | |||
Paired-Point Lifting for Enhanced Privacy-Preserving Visual Localization | |||
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation | |||
gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction | |||
Accidental Light Probes | |||
Learning to Predict Scene-Level Implicit 3D from Posed RGBD Data | |||
DPF: Learning Dense Prediction Fields with Weak Supervision | |||
DIFu: Depth-guided Implicit Function for Clothed Human Reconstruction | |||
OrienterNet: Visual Localization in 2D Public Maps with Neural Matching | |||
Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes | |||
Structured 3D Features for Reconstructing Controllable Avatars | |||
Delving into Discrete Normalizing Flows on SO(3) Manifold for Probabilistic Rotation Modeling | |||
High-Fidelity 3D Human Digitization from Single 2K Resolution Images | |||
Learning 3D-Aware Image Synthesis with Unknown Pose Distribution | |||
DP-NeRF: Deblurred Neural Radiance Field with Physical Scene Priors | |||
Recovering 3D Hand Mesh Sequence from a Single Blurry Image: A New Dataset and Temporal Unfolding | |||
Visibility Aware Human-Object Interaction Tracking from Single RGB Camera | |||
SMOC-Net: Leveraging Camera Pose for Self-Supervised Monocular Object Pose Estimation | |||
Curricular Object Manipulation in LiDAR-based Object Detection | |||
SeSDF: Self-evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction | |||
MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer | |||
Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion | |||
High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition | |||
NeRDi: Single-View NeRF Synthesis with Language-guided Diffusion as General Image Priors | |||
ACL-SPC: Adaptive Closed-Loop System for Self-Supervised Point Cloud Completion | |||
Self-Positioning Point-based Transformer for Point Cloud Understanding | |||
H2ONet: Hand-Occlusion-and-Orientation-Aware Network for Real-Time 3D Hand Mesh Reconstruction | |||
A Probabilistic Attention Model with Occlusion-Aware Texture Regression for 3D Hand Reconstruction from a Single RGB Image | |||
Neural Voting Field for Camera-Space 3D Hand Pose Estimation | |||
PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation | |||
Distilling Neural Fields for Real-Time Articulated Shape Reconstruction | |||
Power Bundle Adjustment for Large-Scale 3D Reconstruction | |||
What You Can Reconstruct from a Shadow | |||
SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction | |||
Diverse 3D Hand Gesture Prediction from Body Dynamics by Bilateral Hand Disentanglement | |||
Trap Attention: Monocular Depth Estimation with Manual Traps | |||
Crowd3D: Towards Hundreds of People Reconstruction from a Single Image | |||
PAniC-3D: Stylized Single-View 3D Reconstruction from Portraits of Anime Characters | |||
HS-Pose: Hybrid Scope Feature Extraction for Category-Level Object Pose Estimation | |||
A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction from In-the-Wild Images | |||
Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation | |||
SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks | |||
BITE: Beyond Priors for Improved Three-D Dog Pose Estimation | |||
SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene | |||
Flow Supervision for Deformable NeRF | |||
Single Image Depth Prediction Made Better: A Multivariate Gaussian Take | |||
CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects | |||
PanoSwin: A Pano-Style Swin Transformer for Panorama Understanding | |||
CP3: Channel Pruning Plug-In for Point-based Networks | |||
PC2: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction | |||
On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks | |||
Cross-Domain 3D Hand Pose Estimation with Dual Modalities | |||
RealFusion 360° Reconstruction of Any Object from a Single Image | |||
Sampling is Matter: Point-guided 3D Human Mesh Reconstruction | |||
Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions | |||
BAAM: Monocular 3D Pose and Shape Reconstruction with Bi-Contextual Attention Module and Attention-guided Modeling | |||
Single View Scene Scale Estimation using Scale Field | |||
Learning Articulated Shape with Keypoint Pseudo-Labels from Web Images | |||
Deformable Mesh Transformer for 3D Human Mesh Recovery |
Title | Repo | Paper | Video |
---|---|---|---|
Decoupled Semantic Prototypes Enable Learning from Diverse Annotation Types for Semi-Weakly Segmentation in Expert-Driven Domains | |||
Geometric Visual Similarity Learning in 3D Medical Image Self-Supervised Pre-Training | |||
Flexible-Cm GAN: Towards Precise 3D Dose Prediction in Radiotherapy | |||
Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation | |||
MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery | |||
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images | |||
Label-Free Liver Tumor Segmentation | |||
Devil is in the Queries: Advancing Mask Transformers for Real-World Medical Image Segmentation and Out-of-Distribution Localization | |||
DoNet: Deep De-Overlapping Network for Cytology Instance Segmentation | |||
SQUID: Deep Feature In-Painting for Unsupervised Anomaly Detection | |||
Learning Federated Visual Prompt in Null Space for MRI Reconstruction | |||
Pseudo-Label Guided Contrastive Learning for Semi-Supervised Medical Image Segmentation | |||
Image Quality-Aware Diagnosis via Meta-Knowledge Co-Embedding | |||
Iterative Next Boundary Detection for Instance Segmentation of Tree Rings in Microscopy Images of Shrub Cross Sections | |||
Dynamic Graph Enhanced Contrastive Learning for Chest X-Ray Report Generation | |||
Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding | |||
Bi-Directional Feature Fusion Generative Adversarial Network for Ultra-High Resolution Pathological Image Virtual Re-Staining | |||
KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation | |||
Weakly Supervised Segmentation with Point Annotations for Histopathology Images via Contrast-based Variational Model | |||
Ambiguous Medical Image Segmentation using Diffusion Models | |||
Causally-Aware Intraoperative Imputation for Overall Survival Time Prediction | |||
Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data | |||
GradICON: Approximate Diffeomorphisms via Gradient Inverse Consistency | |||
Fair Federated Medical Image Segmentation via Client Contribution Estimation | |||
Histopathology whole Slide Image Analysis with Heterogeneous Graph Representation Learning | |||
Unsupervised Contour Tracking of Live Cells by Mechanical and Cycle Consistency Losses | |||
Learning to Exploit Temporal Structure for Biomedical Vision–Language Processing | |||
RepMode: Learning to Re-Parameterize Diverse Experts for Subcellular Structure Prediction | |||
Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-Pixel Images | |||
Towards Trustable Skin Cancer Diagnosis via Rewriting Model's Decision | |||
Task-Specific Fine-Tuning via Variational Information Bottleneck for Weakly-Supervised Pathology whole Slide Image Classification | |||
TINC: Tree-Structured Implicit Neural Compression | |||
Topology-guided Multi-Class Cell Context Generation for Digital Pathology | |||
Directional Connectivity-based Segmentation of Medical Images | |||
A Soma Segmentation Benchmark in Full Adult Fly Brain | |||
Constrained Evolutionary Diffusion Filter for Monocular Endoscope Tracking | |||
Benchmarking Self-Supervised Learning on Diverse Pathology Datasets | |||
DualRel: Semi-Supervised Mitochondria Segmentation from a Prototype Perspective | |||
SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation | |||
OCELOT: Overlapped Cell on Tissue Dataset for Histopathology | |||
Orthogonal Annotation Benefits Barely-Supervised Medical Image Segmentation | |||
DeGPR: Deep Guided Posterior Regularization for Multi-Class Cell Detection and Counting | |||
Interactive and Explainable Region-guided Radiology Report Generation | |||
A Loopback Network for Explainable Microvascular Invasion Classification | |||
Interventional Bag Multi-Instance Learning On Whole-Slide Pathological Images | |||
MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition | |||
Neuralizer: General Neuroimage Analysis without Re-Training | |||
Why is the Winner the Best? | |||
Rethinking Few-Shot Medical Segmentation: A Vector Quantization View | |||
PEFAT: Boosting Semi-Supervised Medical Image Classification via Pseudo-Loss Estimation and Feature Adversarial Training | |||
Indescribable Multi-Modal Spatial Evaluator | |||
Multiple Instance Learning via Iterative Self-paced Supervised Contrastive Learning | |||
Hierarchical Discriminative Learning Improves Visual Representations of Biomedical Microscopy |
Title | Repo | Paper | Video |
---|---|---|---|
Open Set Action Recognition via Multi-Label Evidential Learning | |||
FLAG3D: A 3D Fitness Activity Dataset with Language Instruction | |||
MoLo: Motion-augmented Long-Short Contrastive Learning for Few-Shot Action Recognition | |||
The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction | |||
Use Your Head: Improving Long-Tail Video Recognition | |||
Decomposed Cross-Modal Distillation for RGB-based Temporal Action Detection | |||
Video Test-Time Adaptation for Action Recognition | |||
How Can Objects Help Action Recognition? | |||
Text-Visual Prompting for Efficient 2D Temporal Video Grounding | |||
Enlarging Instance-Specific and Class-Specific Information for Open-Set Action Recognition | |||
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition | |||
Learning Video Representations from Large Language Models | |||
Fine-tuned CLIP Models are Efficient Video Learners | |||
Efficient Movie Scene Detection using State-Space Transformers | |||
AdamsFormer for Spatial Action Localization in the Future | |||
A Light Weight Model for Active Speaker Detection | |||
System-Status-Aware Adaptive Network for Online Streaming Video Understanding | |||
STMixer: A One-Stage Sparse Action Detector | |||
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring | |||
Distilling Vision-Language Pre-Training to Collaborate with Weakly-Supervised Temporal Action Localization | |||
Real-Time Multi-Person Eyeblink Detection in the Wild for Untrimmed Video | |||
Modeling Video as Stochastic Processes for Fine-grained Video Representation Learning | |||
Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization | |||
Learning Discriminative Representations for Skeleton based Action Recognition | |||
Learning Procedure-Aware Video Representation from Instructional Videos and Their Narrations | |||
Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception | |||
PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization | |||
Cascade Evidential Learning for Open-World Weakly-Supervised Temporal Action Localization | |||
Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks | |||
SVFormer: Semi-Supervised Video Transformer for Action Recognition | |||
AutoAD: Movie Description in Context | |||
STMT: A Spatial-Temporal Mesh Transformer for MoCap-based Action Recognition | |||
Boosting Weakly-Supervised Temporal Action Localization with Text Information | |||
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations | |||
Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels | |||
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos | |||
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline | |||
LOGO: A Long-Form Video Dataset for Group Action Quality Assessment | |||
Search-Map-Search: A Frame Selection Paradigm for Action Recognition | |||
3Mformer: Multi-Order Multi-Mode Transformer for Skeletal Action Recognition | |||
ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding | |||
Egocentric Video Task Translation | |||
Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning | |||
Proposal-based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization | |||
TriDet: Temporal Action Detection with Relative Boundary Modeling | |||
Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-based Action Recognition | |||
EVAL: Explainable Video Anomaly Localization | |||
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning | |||
StepFormer: Self-Supervised Step Discovery and Localization in Instructional Videos | |||
Weakly Supervised Temporal Sentence Grounding with Uncertainty-guided Self-Training | |||
Leveraging Temporal Context in Low Representational Power Regimes | |||
PIVOT: Prompting for Video Continual Learning | |||
On the Benefits of 3D Pose and Tracking for Human Action Recognition | |||
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory | |||
Selective Structured State-Spaces for Long-Form Video Understanding | |||
Frame Flexible Network | |||
ASPnet: Action Segmentation with Shared-Private Representation of Multiple Data Sources | |||
Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling | |||
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge | |||
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning | |||
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models | |||
Procedure-Aware Pretraining for Instructional Video Understanding | |||
Latency Matters: Real-Time Action Forecasting Transformer | |||
Generating Anomalies for Video Anomaly Detection with Prompt-based Feature Mapping | |||
HierVL: Learning Hierarchical Video-Language Embeddings | |||
Two-Stream Networks for Weakly-Supervised Temporal Action Localization with Semantic-Aware Mechanisms | |||
Hybrid Active Learning via Deep Clustering for Video Action Detection | |||
Prompt-guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features | |||
Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection | |||
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking | |||
PDPP: Projected Diffusion for Procedure Planning in Instructional Videos | |||
Learning Action Changes by Measuring Verb-Adverb Textual Relationships | |||
Reducing the Label Bias for Timestamp Supervised Temporal Action Segmentation | |||
Video Event Restoration based on Keyframes for Video Anomaly Detection | |||
Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition | |||
Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting | |||
Post-Processing Temporal Action Detection | |||
Relational Space-Time Query in Long-Form Videos | |||
Therbligs in Action: Video Understanding through Motion Primitives | |||
Dual-Path Adaptation from Image to Video Transformers | |||
Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection | |||
Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection | |||
Unbiased Scene Graph Generation in Videos |
Title | Repo | Paper | Video |
---|---|---|---|
GraVoS: Voxel Selection for 3D Point-Cloud Detection | |||
BEV@DC: Bird's-Eye View Assisted Training for Depth Completion | |||
Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark | |||
PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer | |||
End-to-End Vectorized HD-Map Construction with Piecewise Bezier Curve | |||
MoDAR: Using Motion Forecasting for 3D Object Detection in Point Cloud Sequences | |||
LaserMix for Semi-Supervised LiDAR Semantic Segmentation | |||
MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection | |||
LiDAR2Map: In Defense of LiDAR-based Semantic Map Construction using Online Camera Distillation | |||
Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving | |||
Planning-oriented Autonomous Driving | |||
Distilling Focal Knowledge from Imperfect Expert for 3D Object Detection | |||
Anchor3DLane: Learning to Regress 3D Anchors for Monocular 3D Lane Detection | |||
SliceMatch: Geometry-guided Aggregation for Cross-View Pose Estimation | |||
Azimuth Super-Resolution for FMCW Radar in Autonomous Driving | |||
V2V4Real: A Real-World Large-Scale Dataset for Vehicle-to-Vehicle Cooperative Perception | |||
Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving | |||
Coaching a Teachable Student | |||
BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks | |||
Center Focusing Network for Real-Time LiDAR Panoptic Segmentation | |||
IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction | |||
Weakly Supervised Monocular 3D Object Detection using Multi-View Projection and Direction Consistency | |||
CXTrack: Improving 3D Point Cloud Tracking with Contextual Information | |||
ReasonNet: End-to-End Driving with Temporal and Global Reasoning | |||
Seeing with Sound: Long-Range Acoustic Beamforming for Multimodal Scene Understanding | |||
LinK: Linear Kernel for LiDAR-based 3D Perception | |||
Understanding the Robustness of 3D Object Detection with Bird's-Eye-View Representations in Autonomous Driving | |||
Tri-Perspective View for Vision-based 3D Semantic Occupancy Prediction | |||
SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping using Monocular Frontal View Images | |||
BEV-LaneDet: An Efficient 3D Lane Detection based on Virtual Camera via Key-Points | |||
OcTr: Octree-based Transformer for 3D Object Detection | |||
Instant Domain Augmentation for LiDAR Semantic Segmentation | |||
ViP3D: End-to-End Visual Trajectory Prediction via 3D Agent Queries | |||
UniSim: A Neural Closed-Loop Sensor Simulator | |||
Learning Compact Representations for LiDAR Completion and Generation | |||
Towards Unsupervised Object Detection from LiDAR Point Clouds | |||
Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking | |||
Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving | |||
X3KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection | |||
PeakConv: Learning Peak Receptive Field for Radar Semantic Segmentation | |||
GD-MAE: Generative Decoder for MAE Pre-Training on LiDAR Point Clouds | |||
Neural Map Prior for Autonomous Driving | |||
Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field | |||
Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation with Implicit Neural Representations | |||
Single Domain Generalization for LiDAR Semantic Segmentation | |||
Uncertainty-Aware Vision-based Metric Cross-View Geolocalization | |||
MixSim: A Hierarchical Framework for Mixed Reality Traffic Simulation | |||
PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds | |||
Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection | |||
CAPE: Camera View Position Embedding for Multi-View 3D Object Detection | |||
LiDAR-in-the-Loop Hyperparameter Optimization | |||
Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection | |||
FEND: A Future Enhanced Distribution-Aware Contrastive Learning Framework for Long-Tail Trajectory Prediction | |||
Temporal Consistent 3D LiDAR Representation Learning for Semantic Perception in Autonomous Driving | |||
Density-Insensitive Unsupervised Domain Adaption on 3D Object Detection | |||
SGLoc: Scene Geometry Encoding for Outdoor LiDAR Localization | |||
TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving | |||
Localized Semantic Feature Mixers for Efficient Pedestrian Detection in Autonomous Driving | |||
Deep Dive into Gradients: Better Optimization for 3D Object Detection with Gradient-corrected IoU Supervision | |||
ProphNet: Efficient Agent-Centric Motion Forecasting with Anchor-informed Proposals | |||
BEVHeight: A Robust Framework for Vision-based Roadside 3D Object Detection | |||
VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion | |||
Hidden Gems: 4D Radar Scene Flow Learning using Cross-Modal Supervision | |||
Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss | |||
Query-Centric Trajectory Prediction | |||
Efficient Hierarchical Entropy Model for Learned Point Cloud Compression | |||
Novel Class Discovery for 3D Point Cloud Semantic Segmentation | |||
MotionDiffuser: Controllable Multi-Agent Motion Prediction using Diffusion | |||
FJMP: Factorized Joint Multi-Agent Motion Prediction over Learned Directed Acyclic Interaction Graphs |
Title | Repo | Paper | Video |
---|---|---|---|
Large-Scale Training Data Search for Object Re-Identification | |||
Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-grained Educational Videos | |||
V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting | |||
NewsNet: A Novel Dataset for Hierarchical Temporal Segmentation | |||
CLOTH4D: A Dataset for Clothed Human Reconstruction | |||
Accelerating Dataset Distillation via Model Augmentation | |||
ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing | |||
Visual Atoms: Pre-Training Vision Transformers with Sinusoidal Waves | |||
Infinite Photorealistic Worlds using Procedural Generation | |||
CelebV-Text: A Large-Scale Facial Text-Video Dataset | |||
Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo | |||
Connecting Vision and Language with Video Localized Narratives | |||
Towards Artistic Image Aesthetics Assessment: A Large-scale Dataset and a New Method | |||
MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos | |||
Toward RAW Object Detection: A New Benchmark and A New Model | |||
Objaverse: A Universe of Annotated 3D Objects | |||
Habitat-Matterport 3D Semantics Dataset | |||
Similarity Metric Learning for RGB-Infrared Group Re-Identification | |||
MISC210K: A Large-Scale Dataset for Multi-Instance Semantic Correspondence | |||
WeatherStream: Light Transport Automation of Single Image Deweathering | |||
MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices | |||
GeoNet: Benchmarking Unsupervised Adaptation Across Geographies | |||
Logical Consistency and Greater Descriptive Power for Facial Hair Attribute Learning | |||
PACO: Parts and Attributes of Common Objects | |||
Understanding Deep Generative Models with Generalized Empirical Likelihoods | |||
BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion | |||
Unicode Analogies: An Anti-Objectivist Visual Reasoning Challenge | |||
A New Comprehensive Benchmark for Semi-Supervised Video Anomaly Detection and Anticipation | |||
An In-Depth Exploration of Person Re-Identification and Gait Recognition in Cloth-Changing Conditions | |||
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation | |||
BiasBed – Rigorous Texture Bias Evaluation | |||
A Large-Scale Homography Benchmark | |||
Exploring and Utilizing Pattern Imbalance | |||
Full or Weak Annotations? An Adaptive Strategy for Budget-constrained Annotation Campaigns | |||
ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects | |||
Open-Vocabulary Attribute Detection | |||
Visual DNA: Representing and Comparing Images using Distributions of Neuron Activations | |||
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective | |||
An Image Quality Assessment Dataset for Portraits | |||
Multi-Sensor Large-Scale Dataset for Multi-View 3D Reconstruction | |||
3D-POP - An Automated Annotation Approach to Facilitate Markerless 2D-3D Tracking of Freely Moving Birds with Marker-based Motion Capture | |||
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation | |||
Visual Localization using Imperfect 3D Models from the Internet | |||
Fantastic Breaks: A Dataset of Paired 3D Scans of Real-World Broken Objects and Their Complete Counterparts | |||
StarCraftImage: A Dataset for Prototyping Spatial Reasoning Methods for Multi-Agent Environments | |||
MammalNet: A Large-Scale Video Benchmark for Mammal Recognition and Behavior Understanding | |||
A Large-Scale Robustness Analysis of Video Action Recognition Models | |||
Affection: Learning Affective Explanations for Real-World Visual Data | |||
ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations | |||
Deep Depth Estimation from Thermal Image | |||
DF-Platter: Multi-Face Heterogeneous Deepfake Dataset | |||
A New Dataset based on Images Taken by Blind People for Testing the Robustness of Image Classification Models Trained for ImageNet Categories | |||
RealImpact: A Dataset of Impact Sound Fields for Real Objects | |||
NICO++: Towards Better Benchmarking for Domain Generalization |
Title | Repo | Paper | Video |
---|---|---|---|
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization | |||
Sibling-Attack: Rethinking Transferable Adversarial Attacks Against Face Recognition | |||
T-SEA: Transfer-based Self-Ensemble Attack on Object Detection | |||
The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training | |||
Trade-Off between Robustness and Accuracy of Vision Transformers | |||
Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling | |||
Proximal Splitting Adversarial Attack for Semantic Segmentation | |||
Feature Separation and Recalibration for Adversarial Robustness | |||
Enhancing the Self-Universality for Transferable Targeted Attacks | |||
Backdoor Defense via Adaptively Splitting Poisoned Dataset | |||
Dynamic Generative Targeted Attacks with Pattern Injection | |||
Exploring the Relationship between Architectural Design and Adversarially Robust Generalization | |||
Discrete Point-Wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition | |||
Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial Attacks | |||
MaLP: Manipulation Localization using a Proactive Scheme | |||
TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets | |||
Minimizing Maximum Model Discrepancy for Transferable Black-Box Targeted Attacks | |||
Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization | |||
AGAIN: Adversarial Training with Attribution Span Enlargement and Hybrid Feature Fusion | |||
Backdoor Defense via Deconfounded Representation Learning | |||
Adversarially Robust Neural Architecture Search for Graph Neural Networks | |||
PointCert: Point Cloud Classification with Deterministic Certified Robustness Guarantees | |||
Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations | |||
Physically Adversarial Infrared Patches with Learnable Shapes and Locations | |||
Color Backdoor: A Robust Poisoning Attack in Color Space | |||
Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition | |||
Turning Strengths into Weaknesses: A Certified Robustness Inspired Attack Framework Against Graph Neural Networks | |||
Randomized Adversarial Training via Taylor Expansion | |||
Backdoor Cleansing with Unlabeled Data | |||
The Best Defense is a Good Offense: Adversarial Augmentation Against Adversarial Attacks | |||
Ensemble-based Blackbox Attacks on Dense Prediction | |||
Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning | |||
Adversarial Robustness via Random Projection Filters | |||
Boundary Unlearning: Rapid Forgetting of Deep Networks via Shifting the Decision Boundary | |||
Physical-World Optical Adversarial Attacks on 3D Face Recognition | |||
Black-Box Sparse Adversarial Attack via Multi-Objective Optimisation CVPR Proceedings | |||
How to Backdoor Diffusion Models? | |||
The Resource Problem of using Linear Layer Leakage Attack in Federated Learning | |||
Efficient Loss Function by Minimizing the Detrimental Effect of Floating-Point Errors on Gradient-based Attacks | |||
Detecting Backdoors in Pre-trained Encoders | |||
Can't Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders | |||
CFA: Class-Wise Calibrated Fair Adversarial Training | |||
Towards Transferable Targeted Adversarial Examples | |||
Hierarchical Fine-grained Image Forgery Detection and Localization | |||
RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation with Natural Prompts | |||
SlowLiDAR: Increasing the Latency of LiDAR-based Detection using Adversarial Examples | |||
Progressive Backdoor Erasing via Connecting Backdoor and Adversarial Attacks | |||
Improving the Transferability of Adversarial Samples by Path-Augmented Method | |||
Boosting Accuracy and Robustness of Student Models via Adaptive Adversarial Distillation | |||
StyLess: Boosting the Transferability of Adversarial Examples | |||
Introducing Competition to Boost the Transferability of Targeted Adversarial Examples through Clean Feature Mixup | |||
Transferable Adversarial Attacks on Vision Transformers with Token Gradient Regularization | |||
Jedi: Entropy-based Localization and Removal of Adversarial Patches | |||
Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts | |||
CUDA: Convolution-based Unlearnable Datasets | |||
Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression | |||
Generalist: Decoupling Natural and Robust Generalization | |||
The Dark Side of Dynamic Routing Neural Networks: Towards Efficiency Backdoor Injection | |||
Revisiting Residual Networks for Adversarial Robustness | |||
Detecting Backdoors During the Inference Stage based on Corruption Robustness Consistency | |||
Cooperation or Competition: Avoiding Player Domination for Multi-Target Robustness via Adaptive Budgets |
Title | Repo | Paper | Video |
---|---|---|---|
Polarimetric iToF: Measuring High-Fidelity Depth through Scattering Media | |||
All-in-Focus Imaging from Event Focal Stack | |||
Learning Event Guided High Dynamic Range Video Reconstruction | |||
Propagate and Calibrate: Real-Time Passive Non-Line-of-Sight Tracking | |||
Efficient View Synthesis and 3D-based Multi-Frame Denoising with Multiplane Feature Representations | |||
Occlusion-Free Scene Recovery via Neural Radiance Fields | |||
Image Super-Resolution using T-Tetromino Pixels | |||
Event-based Blurry Frame Interpolation under Blind Exposure | |||
Decoupling-and-Aggregating for Image Exposure Correction | |||
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining | |||
The Differentiable Lens: Compound Lens Search over Glass Surfaces and Materials for Object Detection | |||
Megahertz Light Steering without Moving Parts | |||
Text2Scene: Text-Driven Indoor Scene Stylization with Part-Aware Details | |||
RankMix: Data Augmentation for Weakly Supervised Learning of Classifying whole Slide Images with Diverse Sizes and Imbalanced Categories | |||
Guided Depth Super-Resolution by Deep Anisotropic Diffusion | |||
K3DN: Disparity-Aware Kernel Estimation for Dual-Pixel Defocus Deblurring | |||
Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments | |||
Low-Light Image Enhancement via Structure Modeling and Guidance | |||
Analyzing Physical Impacts using Transient Surface Wave Imaging | |||
DC2: Dual-Camera Defocus Control by Learning to Refocus | |||
pCON: Polarimetric Coordinate Networks for Neural Scene Representations | |||
Joint HDR Denoising and Fusion: A Real-World Mobile HDR Image Dataset | |||
NLOST: Non-Line-of-Sight Imaging with Transformer | |||
1000 FPS HDR Video with a Spike-RGB Hybrid Camera | |||
Thermal Spread Functions (TSF): Physics-guided Material Classification | |||
Structured Kernel Estimation for Photon-Limited Deconvolution | |||
EfficientSCI: Densely Connected Network with Space-Time Factorization for Large-Scale Video Snapshot Compressive Imaging | |||
EvShutter: Transforming Events for Unconstrained Rolling Shutter Correction | |||
Tunable Convolutions with Parametric Multi-Loss Optimization | |||
Non-Line-of-Sight Imaging with Signal Superresolution Network | |||
Few-Shot Non-Line-of-Sight Imaging with Signal-Surface Collaborative Regularization | |||
Seeing Electric Network Frequency from Events |
|||
Realistic Saliency Guided Image Enhancement | |||
Learned Image Compression with Mixed Transformer-CNN Architectures | |||
Self-Supervised Blind Motion Deblurring with Deep Expectation Maximization | |||
Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models | |||
Parallel Diffusion Models of Operator and Image for Blind Inverse Problems | |||
Range-Nullspace Video Frame Interpolation with Focalized Motion Estimation | |||
Combining Implicit-Explicit View Correlation for Light Field Semantic Segmentation | |||
Document Image Shadow Removal Guided by Color-Aware Background | |||
Kernel Aware Resampler | |||
Polarized Color Image Denoising | |||
Constructing Deep Spiking Neural Networks from Artificial Neural Networks with Knowledge Distillation | |||
Role of Transients in Two-Bounce Non-Line-of-Sight Imaging | |||
Inverting the Imaging Process by Learning an Implicit Camera Model | |||
Deep Polarization Reconstruction with PDAVIS Events | |||
A Unified Spatial-Angular Structured Light for Single-View Acquisition of Shape and Reflectance | |||
Energy-Efficient Adaptive 3D Sensing | |||
HDR Imaging with Spatially Varying Signal-to-Noise Ratios | |||
Swept-Angle Synthetic Wavelength Interferometry | |||
Passive Micron-Scale Time-of-Flight with Sunlight Interferometry | |||
Implicit View-Time Interpolation of Stereo Videos using Multi-Plane Disparities and Non-Uniform Coordinates | |||
Learning a Deep Color Difference Metric for Photographic Images |
Title | Repo | Paper | Video |
---|---|---|---|
Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction | |||
Tracking Multiple Deformable Objects in Egocentric Videos | |||
Tracking through Containers and Occluders in the Wild | |||
TarViS: A Unified Approach for Target-based Video Segmentation | |||
VideoTrack: Learning to Track Objects via Video Transformer | |||
ARKitTrack: A New Diverse Dataset for Tracking using Mobile RGB-D Data | |||
A Dynamic Multi-Scale Voxel Flow Network for Video Prediction | |||
Representation Learning for Visual Object Tracking by Masked Appearance Transfer | |||
EqMotion: Equivariant Multi-Agent Motion Prediction with Invariant Interaction Reasoning | |||
Semi-Supervised Video Inpainting with Cycle Consistency Constraints | |||
Generalized Relation Modeling for Transformer Tracking | |||
Breaking the Object in Video Object Segmentation |
|||
Unifying Short and Long-Term Tracking with Graph Hierarchies | |||
Simple Cues Lead to a Strong Multi-Object Tracker | |||
Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation | |||
MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors | |||
SeqTrack: Sequence to Sequence Learning for Visual Object Tracking | |||
Joint Visual Grounding and Tracking with Natural Language Specification | |||
Boosting Video Object Segmentation via Space-Time Correspondence Learning | |||
Visual Prompt Multi-Modal Tracking | |||
OVTrack: Open-Vocabulary Multiple Object Tracking | |||
TransFlow: Transformer as Flow Learner | |||
Focus on Details: Online Multi-Object Tracking with Diverse Fine-grained Representation | |||
Autoregressive Visual Tracking | |||
Bootstrapping Objectness from Videos by Relaxed Common Fate and Visual Grouping | |||
Tangentially Elongated Gaussian Belief Propagation for Event-based Incremental Optical Flow Estimation | |||
Bridging Search Region Interaction with Template for RGB-T Tracking | |||
Efficient RGB-T Tracking via Cross-Modality Distillation | |||
MotionTrack: Learning Robust Short-Term and Long-Term Motions for Multi-Object Tracking | |||
Self-Supervised AutoFlow | |||
UTM: A Unified Multiple Object Tracking Model with Identity-Aware Feature Enhancement | |||
BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation | |||
Spatial-then-Temporal Self-Supervised Learning for Video Correspondence | |||
BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects | |||
MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation | |||
Context-Aware Relative Object Queries to Unify Video Instance and Panoptic Segmentation | |||
Unsupervised Space-Time Network for Temporally-Consistent Segmentation of Multiple Motions | |||
Resource-Efficient RGBD Aerial Tracking | |||
MMVC: Learned Multi-Mode Video Compression with Block-based Prediction Mode Selection and Density-Adaptive Entropy Coding | |||
Streaming Video Model | |||
Weakly Supervised Class-Agnostic Motion Prediction for Autonomous Driving | |||
LSTFE-Net: Long Short-Term Feature Enhancement Network for Video Small Object Detection | |||
DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling | |||
SCOTCH and SODA: A Transformer Video Shadow Detection Framework | |||
ZBS: Zero-Shot Background Subtraction via Instance-Level Background Modeling and Foreground Selection | |||
Frame-Event Alignment and Fusion Network for High Frame Rate Tracking |
Title | Repo | Paper | Video |
---|---|---|---|
Context De-confounded Emotion Recognition | |||
Intrinsic Physical Concepts Discovery with Object-Centric Predictive Models | |||
Automatic High Resolution Wire Segmentation and Removal | |||
Class Balanced Adaptive Pseudo Labeling for Federated Semi-Supervised Learning | |||
Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network | |||
Probing Sentiment-Oriented Pre-Training Inspired by Human Sentiment Perception Mechanism | |||
DIP: Dual Incongruity Perceiving Network for Sarcasm Detection | |||
Adaptive Human Matting for Dynamic Videos | |||
LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction | |||
Prototypical Residual Networks for Anomaly Detection and Localization | |||
Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-based Active Learning | |||
Affordance Grounding from Demonstration Video to Target Image | |||
Natural Language-Assisted Sign Language Recognition | |||
CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning | |||
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation | |||
Collaborative Noisy Label Cleaner: Learning Scene-Aware Trailers for Multi-Modal Highlight Detection in Movies | |||
Open-Set Fine-grained Retrieval via Prompting Vision-Language Evaluator | |||
Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking | |||
Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving | |||
Exploiting Unlabelled Photos for Stronger Fine-grained SBIR | |||
What Can Human Sketches Do for Object Detection? | |||
Dynamic Conceptional Contrastive Learning for Generalized Category Discovery | |||
Balanced Energy Regularization Loss for Out-of-Distribution Detection | |||
Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR | |||
CLIP for All Things Zero-Shot Sketch-based Image Retrieval, Fine-grained or Not | |||
PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout | |||
Re-Thinking Federated Active Learning based on Inter-Class Diversity | |||
Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection | |||
Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-World | |||
Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection | |||
AccelIR: Task-Aware Image Compression for Accelerating Neural Restoration | |||
Multiclass Confidence and Localization Calibration for Object Detection | |||
Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-Time Mobile Telepresence | |||
Deep Random Projector: Accelerated Deep Image Prior | |||
SIEDOB: Semantic Image Editing by Disentangling Object and Background |
Title | Repo | Paper | Video |
---|---|---|---|
Object-Goal Visual Navigation via Effective Exploration of Relations among Historical Navigation States | |||
TTA-COPE: Test-Time Adaptation for Category-Level Object Pose Estimation | |||
Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation using Scene Object Spectrum Grounding | |||
Learning Human-to-Robot Handovers from Point Clouds | |||
Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation from Image Sequence | |||
PartManip: Learning Cross-Category Generalizable Part Manipulation Policy from Point Cloud Observations | |||
DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated Objects | |||
PyPose: A Library for Robot Learning with Physics-based Optimization | |||
Target-Referenced Reactive Grasping for Dynamic Objects | |||
Autonomous Manipulation Learning for Similar Deformable Objects via only One Demonstration | |||
Renderable Neural Radiance Map for Visual Navigation | |||
Efficient Map Sparsification based on 2D and 3D Discretized Grids | |||
Policy Adaptation from Foundation Model Feedback | |||
NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis | |||
Markerless Camera-to-Robot Pose Estimation via Self-Supervised Sim-to-Real Transfer | |||
Affordances from Human Videos as a Versatile Representation for Robotics | |||
DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization | |||
GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds | |||
Neural Volumetric Memory for Visual Locomotion Control | |||
Multi-Object Manipulation via Object-Centric Neural Scattering Functions | |||
Local-guided Global: Paired Similarity Representation for Visual Reinforcement Learning | |||
HypLiLoc: Towards Effective LiDAR Pose Regression with Hyperbolic Fusion | |||
Imitation Learning as State Matching via Differentiable Physics |
Title | Repo | Paper | Video |
---|---|---|---|
Effective Ambiguity Attack Against Passport-based DNN Intellectual Property Protection Schemes through Fully Connected Layer Substitution | |||
Progressive Open Space Expansion for Open-Set Model Attribution | |||
Breaching FedMD: Image Recovery via Paired-Logits Inversion Attack | |||
DartBlur: Privacy Preservation with Detection Artifact Suppression | |||
Reinforcement Learning-based Black-Box Model Inversion Attacks | |||
Model-Agnostic Gender Debiased Image Captioning | |||
Uncurated Image-Text Datasets: Shedding Light on Demographic Bias | |||
AltFreezing for more General Video Face Forgery Detection | |||
Make Landscape Flatter in Differentially Private Federated Learning | |||
DynaFed: Tackling Client Data Heterogeneity with Global Dynamics | |||
Re-Thinking Model Inversion Attacks Against Deep Neural Networks | |||
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models | |||
TrojViT: Trojan Insertion in Vision Transformers | |||
Difficulty-based Sampling for Debiased Contrastive Representation Learning | |||
Model Barrier: A Compact Un-Transferable Isolation Domain for Model Intellectual Property Protection | |||
Fair Scratch Tickets: Finding Fair Sparse Networks without Weight Training | |||
CLIP2Protect: Protecting Facial Privacy using Text-guided Makeup via Adversarial Latent Search | |||
Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures | |||
Learning to Generate Image Embeddings with User-Level Differential Privacy | |||
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation | |||
CaPriDe Learning: Confidential and Private Decentralized Learning based on Encryption-Friendly Distillation Loss | |||
DeAR: Debiasing Vision-Language Models with Additive Residuals | |||
Deep Deterministic Uncertainty: A New Simple Baseline | |||
Manipulating Transfer Learning for Property Inference | |||
Training Debiased Subnetworks with Contrastive Weight Pruning | |||
Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models | |||
STDLens: Model Hijacking-Resilient Federated Learning for Object Detection | |||
Architectural Backdoors in Neural Networks | |||
MEDIC: Remove Model Backdoors via Importance Driven Cloning | |||
Learning Debiased Representations via Conditional Attribute Interpolation |
Title | Repo | Paper | Video |
---|---|---|---|
Open-World Multi-Task Control through Goal-Aware Representation Learning and Adaptive Horizon Prediction | |||
Layout-based Causal Inference for Object Navigation | |||
EC2: Emergent Communication for Embodied Control | |||
GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts | |||
Phone2Proc: Bringing Robust Robots into Our Chaotic World | |||
PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav | |||
CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation | |||
3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification | |||
Modality-Invariant Visual Odometry for Embodied Vision | |||
UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy | |||
EXCALIBUR: Encouraging and Evaluating Embodied Exploration | |||
Leverage Interactive Affinity for Affordance Learning | |||
LANA: A Language-Capable Navigator for Instruction Following and Generation | |||
Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-Per-Second |
Title | Repo | Paper | Video |
---|---|---|---|
Towards Flexible Multi-Modal Document Models | |||
Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling | |||
Unifying Layout Generation with a Decoupled Diffusion Model | |||
Conditional Text Image Generation with Diffusion Models | |||
Turning a CLIP Model into a Scene Text Detector | |||
Unifying Vision, Text, and Layout for Universal Document Processing | |||
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild | |||
GeoLayoutLM: Geometric Pre-Training for Visual Information Extraction | |||
Handwritten Text Generation from Visual Archetypes | |||
Towards Robust Tampered Text Detection in Document Image: New dataset and New Solution | |||
M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis | |||
Disentangling Writer and Character Styles for Handwriting Generation |
Title | Repo | Paper | Video |
---|---|---|---|
Deep Incomplete Multi-View Clustering with Cross-View Partial Sample and Prototype Alignment | |||
Towards Better Decision Forests: Forest Alternating Optimization | |||
Class Adaptive Network Calibration | |||
Defining and Quantifying the Emergence of Sparse Concepts in DNNs | |||
MOT Masked Optimal Transport for Partial Domain Adaptation | |||
Adaptive Graph Convolutional Subspace Clustering | |||
Reliable and Interpretable Personalized Federated Learning | |||
Confidence-Aware Personalized Federated Learning via Variational Expectation Maximization | |||
Efficient Verification of Neural Networks Against LVM-based Specifications | |||
You Are Catching My Attention: Are Vision Transformers Bad Learners under Backdoor Attacks? | |||
Leveraging Inter-Rater Agreement for Classification in the Presence of Noisy Labels | |||
Sliced Optimal Partial Transport | |||
A Meta-Learning Approach to Predicting Performance and Data Requirements | |||
Towards Effective Visual Representations for Partial-Label Learning |
Title | Repo | Paper | Video |
---|---|---|---|
Learning Anchor Transformations for 3D Garment Animation | |||
High-Fidelity Event-Radiance Recovery via Transient Event Frequency | |||
Complementary Intrinsics from Neural Radiance Fields and CNNs for Outdoor Scene Relighting | |||
Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection | |||
Event-based Shape from Polarization | |||
Weakly-Supervised Single-View Image Relighting | |||
DANI-Net: Uncalibrated Photometric Stereo by Differentiable Shadow Handling, Anisotropic Reflectance Modeling, and Neural Inverse Rendering | |||
Learning Accurate 3D Shape based on Stereo Polarimetric Imaging | |||
Visibility Constrained Wide-Band Illumination Spectrum Design for Seeing-in-the-Dark | |||
Light Source Separation and Intrinsic Image Decomposition under AC Illumination | |||
OReX: Object Reconstruction from Planar Cross-Sections using Neural Fields | |||
Unsupervised Intrinsic Image Decomposition with LiDAR Intensity |
Title | Repo | Paper | Video |
---|---|---|---|
Pose Synchronization under Multiple Pair-Wise Relative Poses | |||
Adaptive Global Decay Process for Event Cameras | |||
Wide-Angle Rectification via Content-Aware Conformal Mapping | |||
On the Convergence of IRLS and its Variants in Outlier-Robust Estimation | |||
A General Regret Bound of Preconditioned Gradient Method for DNN Training | |||
Robust and Scalable Gaussian Process Regression and its Applications | |||
EMT-NAS: Transferring Architectural Knowledge between Tasks from Different Datasets | |||
Transformer-based Learned Optimization | |||
Efficient Robust Principal Component Analysis via Block Krylov Iteration and CUR Decomposition | |||
Solving Relaxations of MAP-MRF Problems: Combinatorial In-Face Frank-Wolfe Directions | |||
Robust Generalization Against Photon-Limited Corruptions via Worst-Case Sharpness Minimization | |||
Elastic Aggregation for Federated Optimization |
Title | Repo | Paper | Video |
---|---|---|---|
MethaneMapper: Spectral Absorption Aware Hyperspectral Transformer for Methane Detection | |||
Probability-based Global Cross-Modal Upsampling for Pansharpening | |||
Learning Correspondence Uncertainty via Differentiable Nonlinear Least Squares | |||
Dynamic Coarse-to-Fine Learning for Oriented Tiny Object Detection | |||
ViTs for SITS: Vision Transformers for Satellite Image Time Series | |||
Quantum-Inspired Spectral-Spatial Pyramid Network for Hyperspectral Image Classification | |||
TopDiG: Class-Agnostic Topological Directional Graph Extraction from Remote Sensing Images | |||
OmniCity: Omnipotent City Understanding with Multi-Level and Multi-View Images |
Title | Repo | Paper | Video |
---|---|---|---|
Neural Dependencies Emerging from Learning Massive Categories | |||
Gaussian Label Distribution Learning for Spherical Image Object Detection | |||
Unbalanced Optimal Transport: A Unified Framework for Object Detection | |||
DropKey for Vision Transformer | |||
SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries |
Title | Repo | Paper | Video |
---|---|---|---|
A Bag-of-Prototypes Representation for Dataset-Level Applications | |||
Learning to Retain while Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation | |||
Label Information Bottleneck for Label Enhancement | |||
DISC: Learning from Noisy Labels via Dynamic Instance-Specific Selection and Correction | |||
Restoration of Hand-Drawn Architectural Drawings using Latent Space Mapping with Degradation Generator | |||
DaFKD: Domain-Aware Federated Knowledge Distillation | |||
Enhanced Stable View Synthesis | |||
ScaleFL: Resource-Adaptive Federated Learning with Heterogeneous Clients | |||
GradMA: A Gradient-Memory-based Accelerated Federated Learning with Alleviated Catastrophic Forgetting | |||
High-Resolution Image Reconstruction with Latent Diffusion Models from Human Brain Activity | |||
A Unified Knowledge Distillation Framework for Deep Directed Graphical Models | |||
How to Prevent the Poor Performance Clients for Personalized Federated Learning? |