The project is used to store text-only training, image-free training for multimodal tasks related papers.
Include text-only training for:
- zero-shot image/audio/video captioning
- zero-shot composed image retrieval
- visual storytelling, visual question answer...
-
[AAAI] | [ 1] Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training [paper] [code][⭐8]
-
[ACM] | [ 1] TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model[paper]
-
[IJCV] | [ 5] Learning to Prompt with Text Only Supervision for Vision-Language Models[paper] [code][⭐80]
-
[arxiv] | [ 0] Text Data-Centric Image Captioning with Interactive Prompts[paper]
-
[arxiv] | [ 2] MeaCap: Memory-Augmented Zero-shot Image Captioning[paper] [code][⭐27]
-
[arxiv] | [ 0] ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks [paper]
-
[arxiv] | [ 0] Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion [paper]
-
[arxiv] | [ 0] IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning [paper] [code][⭐4]
-
[arxiv] | [ 0] From Unimodal to Multimodal: Scaling up Projectors to Align Modalities [paper]
-
[arxiv] | [ 0] Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning [paper] [code][⭐0]
-
[arxiv] | [ 0] DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning [paper]
-
[arxiv] [ 1] Improved Factorized Neural Transducer Model For text-only Domain Adaptation[paper]
-
[AAAI] [ 2] Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning[paper]
-
[ACM] [ 0] Text-Only Training for Visual Storytelling[paper]
-
[ACM] [ 1] VLIS: Unimodal Language Models Guide Multimodal Language Generation[paper] [code][⭐24]
-
[NeurlPS] [ 7] LOVM:Language-Only Vision Model Selection[paper] [code][⭐18]
-
[IJCAI] [ 4] From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping[paper] [code][⭐11]
-
[ICLR] [ 47] Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training[paper] [code][⭐118]
-
[DCASE] [ 4] Weakly-supervised Automated Audio Captioning via text only training[paper] [code][⭐1]
-
[ACM] [ 3] CgT-GAN: CLIP-guided Text GAN for Image Captioning[paper] [code][⭐16]
-
[EMNLP] [ 64] Text-Only Training for Image Captioning using Noise-Injected CLIP[paper] [code][⭐179]
-
[ICCV] [ 11] I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision[paper][code][⭐55]
-
[arxiv] [ 28] Multimodal Knowledge Alignment with Reinforcement Learning[paper][code][⭐22]