awesome-vision-language-pretraining

An up-to-date list of vision-language (VL) pre-training paper! Maintained by Haofan Wang (haofanwang.ai@gmail.com).

SSL+Pre-training

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, Google.

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation, Baidu.

*FLAVA: A Foundational Language And Vision Alignment Model, Meta AI.

A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision, Meta AI.

*SLIP: Self-supervision meets Language-Image Pre-training, [code], Meta AI.

*MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling, Amazon.

*Data Efficient Masked Language Modeling for Vision and Language, [code], Ben Gurion University.

VL-BERT: Pre-training of Generic Visual-Linguistic Representations, [code], MSRA.

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, [code], MSRA.

UNITER: UNiversal Image-TExt Representation Learning, [code], Microsoft.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, [code], Meta AI.

Masked Image Modeling

Masked Feature Prediction for Self-Supervised Visual Pre-Training, Meta AI.

SimMIM: A Simple Framework for Masked Image Modeling, [code], MSRA.

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [code], MSRA.

Masked Autoencoders Are Scalable Vision Learners, [Unofficial code], Meta AI.

iBOT: Image BERT Pre-Training with Online Tokenizer, [code], ByteDance .

BEiT: BERT Pre-Training of Image Transformers, [code], MSRA.

Masked Language Modeling

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, [code], Google.

Others

Align and Prompt: Video-and-Language Pre-training with Entity Prompts, Saleforce.

FILIP: Fine-grained Interactive Language-Image Pre-Training, [Unofficial code], Huawei.

LiT: Zero-Shot Transfer with Locked-image Text Tuning, [Unofficial code], Google.

Multimodal Few-Shot Learning with Frozen Language Models, DeepMind.

Prompting Visual-Language Models for Efficient Video Understanding, [code], SJTU.

TiP-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling, [code], CUHK.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-vision-language-pretraining

SSL+Pre-training

Masked Image Modeling

Masked Language Modeling

Others

About

Releases

Packages

haofanwang/awesome-vision-language-modeling

Folders and files

Latest commit

History

Repository files navigation

awesome-vision-language-pretraining

SSL+Pre-training

Masked Image Modeling

Masked Language Modeling

Others

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages