Visual Language Models (VLMs) have made significant progress in various downstream tasks by developing large-scale multimodal models. However, they sometimes lack reasoning and contextual learning abilities. On the other hand, Large Language Models (LLMs) have revolutionized the NLP community with their strong reasoning and contextual learning capabilities. LLMs can quickly adapt to new tasks involving inference without fine-tuning pre-trained models or parameter updates, such as question answering and commonsense reasoning.
Studying in context learning abilities contributes to VLMs' ability to generalize new knowledge in lifelong learning environments, develop learnable capabilities, and advance artificial intelligence skills. Therefore, we propose the MIC(Multimodality In-Context Learning) dataset. This is a manually constructed instruction tuning dataset supports interleaved text-image inputs, inter-related multiple image inputs, and multimodal in-context learning inputs. By finetuning VLMs on MIC, we enable them to possess multimodal in-context learning capabilities and understand complex relationships between instructions and multiple images.
The start_genate.sh script generates a one-shot instance of various datasets in the template format we provide. The clip_video.py and split_image.py scripts are utilized to individually process the video datasets and VCR dataset.
The template and config of each datasets can be found in the prompts folder. The dataloader class of different datasets can be found in the datasets_new folder.
Afterwards, you can use the data_preprocess_save_arrow.py script to load and save the data into arrow files.
You can easily load the save arrow files into the Dataset of the huggingface datasets using the following code:
from datasets import load_metric,Dataset,concatenate_datasets
from glob import glob
import random
import os
def load_instruct_dataset_from_arrow(dataset_folder):
train_files = []
test_files = []
val_files = []
for dataset_name in os.listdir(dataset_folder):
dataset_path = os.path.join(dataset_folder, dataset_name)
for dir in os.listdir(dataset_path):
folder = os.path.join(dataset_path, dir)
if 'train' in folder:
# train_files.extend(glob(os.path.join(folder, "mmicl*.arrow")))
arrow_files = glob(os.path.join(folder, "mmicl*.arrow"))
train_files.extend(arrow_files)
elif 'test' in folder:
arrow_files = glob(os.path.join(folder, "mmicl*.arrow"))
test_files.extend(arrow_files)
elif 'val' in folder:
arrow_files = glob(os.path.join(folder, "mmicl*.arrow"))
val_files.extend(arrow_files)
train_ds = concatenate_datasets([Dataset.from_file(score) for score in train_files ])
test_ds = concatenate_datasets([Dataset.from_file(score) for score in test_files ])
val_ds = concatenate_datasets([Dataset.from_file(score) for score in val_files ])
return train_ds,val_ds,test_ds
def load_dataset_from_arrow(data_files):
files = glob(join(data_files,"mmicl*[0-9].arrow"))
ds = concatenate_datasets([Dataset.from_file(score) for score in files ])
return ds