An up-to-date list of vision-language (VL) pre-training paper! Maintained by Haofan Wang (haofanwang.ai@gmail.com).
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, Google.
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation, Baidu.
*FLAVA: A Foundational Language And Vision Alignment Model, Meta AI.
A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision, Meta AI.
*SLIP: Self-supervision meets Language-Image Pre-training, [code], Meta AI.
*MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling, Amazon.
*Data Efficient Masked Language Modeling for Vision and Language, [code], Ben Gurion University.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations, [code], MSRA.
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, [code], MSRA.
UNITER: UNiversal Image-TExt Representation Learning, [code], Microsoft.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, [code], Meta AI.
Masked Feature Prediction for Self-Supervised Visual Pre-Training, Meta AI.
SimMIM: A Simple Framework for Masked Image Modeling, [code], MSRA.
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [code], MSRA.
Masked Autoencoders Are Scalable Vision Learners, [Unofficial code], Meta AI.
iBOT: Image BERT Pre-Training with Online Tokenizer, [code], ByteDance .
BEiT: BERT Pre-Training of Image Transformers, [code], MSRA.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, [code], Google.
Align and Prompt: Video-and-Language Pre-training with Entity Prompts, Saleforce.
FILIP: Fine-grained Interactive Language-Image Pre-Training, [Unofficial code], Huawei.
LiT: Zero-Shot Transfer with Locked-image Text Tuning, [Unofficial code], Google.
Multimodal Few-Shot Learning with Frozen Language Models, DeepMind.
Prompting Visual-Language Models for Efficient Video Understanding, [code], SJTU.
TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling, [code], CUHK.