Skip to content

text-only training or language-free training for multimodal tasks (image/audio/video caption, retrieval, text2image)

Notifications You must be signed in to change notification settings

iOPENCap/awesome-unimodal-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 

Repository files navigation

Awesome-Text-only-Training

The project is used to store text-only training, image-free training for multimodal tasks related papers.

Include text-only training for:

  • zero-shot image/audio/video captioning
  • zero-shot composed image retrieval
  • visual storytelling, visual question answer...

papers


> 2024

  • [AAAI] | [ 1] Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training [paper] [code][⭐8]

  • [ACM] | [ 1] TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model[paper]

  • [IJCV] | [ 5] Learning to Prompt with Text Only Supervision for Vision-Language Models[paper] [code][⭐80]

  • [arxiv] | [ 0] Text Data-Centric Image Captioning with Interactive Prompts[paper]

  • [arxiv] | [ 2] MeaCap: Memory-Augmented Zero-shot Image Captioning[paper] [code][⭐27]

  • [arxiv] | [ 0] ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks [paper]

  • [arxiv] | [ 0] Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion [paper]

  • [arxiv] | [ 0] IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning [paper] [code][⭐4]

  • [arxiv] | [ 0] From Unimodal to Multimodal: Scaling up Projectors to Align Modalities [paper]

  • [arxiv] | [ 0] Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [paper] [code][⭐0]

  • [arxiv] | [ 0] DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning [paper]


> 2023

  • [arxiv] [ 1] Improved Factorized Neural Transducer Model For text-only Domain Adaptation[paper]

  • [AAAI] [ 2] Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning[paper]

  • [ACM] [ 0] Text-Only Training for Visual Storytelling[paper]

  • [ACM] [ 1] VLIS: Unimodal Language Models Guide Multimodal Language Generation[paper] [code][⭐24]

  • [NeurlPS] [ 7] LOVM:Language-Only Vision Model Selection[paper] [code][⭐18]

  • [IJCAI] [ 4] From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping[paper] [code][⭐11]

  • [ICLR] [ 47] Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training[paper] [code][⭐118]

  • [DCASE] [ 4] Weakly-supervised Automated Audio Captioning via text only training[paper] [code][⭐1]

  • [ACM] [ 3] CgT-GAN: CLIP-guided Text GAN for Image Captioning[paper] [code][⭐16]


> 2022

  • [EMNLP] [ 64] Text-Only Training for Image Captioning using Noise-Injected CLIP[paper] [code][⭐179]

  • [ICCV] [ 11] I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision[paper][code][⭐55]

  • [arxiv] [ 28] Multimodal Knowledge Alignment with Reinforcement Learning[paper][code][⭐22]


> 2021

  • [CVPR] [ 124] LAFITE: Towards Language-Free Training for Text-to-Image Generation[paper] [code][⭐180]