Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

This is the official code repository for "Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning", published at ICLR 2023.

[arXiv] [OpenReview]

Intro

Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered task induction to reduce the hypothesis space. To address these limitations and enable a learnable process, we propose a multimodal meta-learning approach.

Approach Overview

Our approach breaks down the model training into observing a collection of multimodal few-shot tasks. We introduce a meta-mapper network, which serves as a meta-learner, effectively bridging the gap between frozen large-scale vision and language models and leveraging their pre-existing learned capacity. By updating only the learnable parameters of the meta-mapper, it learns to accumulate shared meta-knowledge across these tasks.

Getting Started

First clone the project, create the environment and install dependencies:

git clone https://github.com/ivonajdenkoska/multimodal-meta-learn.git
conda env create -f environment.yml
conda activate multimodal_meta_learn

Download the multimodal few-shot datasets from here and place them in your data folder which will be assigned to --data_path. Also, download the COCO image captioning dataset from here.

Usage

To perform meta-training with COCO captioning dataset, first run parse_coco.py to obtain the preprocessed COCO pickle file. To perform the training of the full model, run python main.py. You can choose the episodic method to perform the meta-training or the non_episodic one to perform standard mini-batched training from this script. To perform inference with trained models, run python main_inference.py.

Reference

If you find this code or the paper useful for your work, please cite:

@inproceedings{
    najdenkoska2023meta,
    title={Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning},
    author={Ivona Najdenkoska and Xiantong Zhen and Marcel Worring},
    booktitle={The Eleventh International Conference on Learning Representations },
    year={2023},
    url={https://openreview.net/forum?id=3oWo92cQyxL}
    }
}

Acknowledgments

This repository uses HuggingFace and is based on ClipCap and MAML code repositories.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
figures		figures
logs		logs
models		models
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
meta_learning_overview.png		meta_learning_overview.png
model.png		model.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Intro

Approach Overview

Getting Started

Usage

Reference

Acknowledgments

About

Releases

Packages

Languages

License

ivonajdenkoska/multimodal-meta-learn

Folders and files

Latest commit

History

Repository files navigation

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Intro

Approach Overview

Getting Started

Usage

Reference

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages