This document provides examples to fine-tune Aria on three different datasets: single-image data, multi-image data and video data.
We use a 30k subset of the RefCOCO dataset as an example.
RefCOCO is a visual grounding task. Given an image and a description of the reference object as input, the model is expected to output corresponding bounding box. For a given bounding box, we normalize its coordinates to [0,1000)
and transform it into "(x1,y1), (x2,y2)". Please refer to RefCOCO_Example for more details!
We use the NLVR2 dataset as an example. NLVR2 (Natural Language for Visual Reasoning) is a task where given two images, the model needs to determine whether a claim is true by answering yes or no. Please refer to NLVR2_Example for details!
We use the NextQA dataset as an example. NextQA requires the model to select an answer from several options according to the video input and question. The model is expected to output the correct option's character. Please refer to NextQA_Example for details!
We use the Magicoder-Evol-Instruct-110k dataset as an example to further finetune Aria for generating high-quality code. Please refer to Code-SFT_Example for details!