[Paper] [Project Page] [Demo 🤗][Dataset 🤗] [Model 🤗]
Authors: Hritik Bansal (UCLA), Yonatan Bitton (Google), Idan Szpektor (Google), Kai-Wei Chang (UCLA), Aditya Grover (UCLA)
This repository contains the data and instructions to reproduce the results of the paper "Videocon: Robust Video-Language Alignment via Contrast Captions".
The following steps are relevant for training and evaluating the model.
- Creating conda environment
conda create -n videocon python=3.10
conda activate videocon
- Install Pytorch
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
- Install other dependencies
pip install -r requirements.txt
We present the fully processed dataset for training your models on the entailment and natural language explanation generation tasks.
source: one of MSR-VTT, VaTeX, TEMPO-HL
videopath: path to the video in the source dataset
caption: video caption
neg_caption: PaLM-2 generated caption
split: one of train, val, and test
misalignment: one of the seven misalignments described in the paper
youtube_key: MSR-VTT and VaTeX videos have youtube ids (metadata)
source: one of MSR-VTT, VaTeX, TEMPO-HL
videopath: path to the video in the source dataset
caption: video caption
neg_caption: PaLM-2 generated caption
nle: PaLM-2 generated natural language explanation
split: one of train, val, and test
misalignment: one of the seven misalignments described in the paper
youtube_key: MSR-VTT and VaTeX videos have youtube ids (metadata)
Note: Original Dataset Licenses apply to individual source data.
We provide detailed steps to download the source dataset videos in the individual README files in the datasets folder.
We collect the video-caption pairs from the validation set of the ActivityNet dataset.
video_url: s3 link to the video
caption: caption associated with the video
neg_caption: human-written negative caption
nle: human-written natural language explanation
hard: True or False (see the definition of Human-Hard in our paper)
We finetune mPLUG-Owl-7B-Video from this repo using Low-Rank Adaptation (LoRA).
Specifically, the model is finetuned on the entailment and natural language explanation generation task together. Firstly, we need to process the data files to format the data akin to the mPLUG-Owl-7B-Video
training.
- Change the
videopath
in thevideocon_llm_entailment.csv
andvideocon_llm_human.csv
such that the paths point to the videos in your local machine. - Run the following command to create the entailment task prompt:
python src/prepare_data_for_train.py --input_csv data/videocon_llm_entailment.csv --output_csv data/train_llm_entailment.csv --entailment
It will generate three files -- train, val, test. 3. Run the following command to create the feedback task prompt:
python src/prepare_data_for_train.py --input_csv data/videocon_llm_feedback.csv --output_csv data/train_llm_feedback.csv --feedback
It will generate three files -- train, val, test. 4. Now merge these files before we can start finetuning the model.
python src/merge.py
This will create data/train_llm_mix_entail_feedback.csv
, data/val_llm_mix_entail_feedback.csv
, data/test_llm_mix_entail_feedback.csv
.
- We add the prompts for generating contrast captions from PaLM-2 in misalignment_prompts.py.
- The prompts will work well with other LLM API too.
- Example Code for PaLM-2 is present in this colab notebook. You will need to create a project on google console first.
- Download mPLUG-Owl-7B-Video pretrained checkpoint in your local machine.
- Add the data file paths and mplug-owl-7b path to video.yaml.
- Add save path, experiment name, and nproc_per_node, path to mplug-owl-7b, and
CUDA_VISIBLE_DEVICES
in train_it.sh script. - Run the following command to launch the training
bash train_it.sh
- You would find the finetuned checkpoints in your
SAVE_PATH
.
Our finetuned VideoCon model is present 🤗 here.
Download the mPLUG-Owl-7B-Video and Owl-Con in your local machine. Their paths are necessary for evaluation.
- Create a csv with two columns:
videopath
andtext
. Example csv is here - Run the following command to embed the entailment prompt to the
text
field:
python src/prepare_data_for_inference.py --input_csv examples/test.csv --output_csv examples/final_test.csv
- Run the following command to get the scores for the video and text in the
final_test.csv
using entailment_inference script.
CUDA_VISIBLE_DEVICES=0 python entailment_inference.py --input_csv ../../examples/final_test.csv --output_csv ../../examples/final_test_scores.csv --trained_ckpt <path to pytorch.bin of videocon ckpt> --pretrained_ckpt <path to mplugowl-7b-video folder> --use_lora --all-params
This will save the entailment scores as an additional column in the final_test_scores.csv
.
- (Optional) Remove
--use_lora
and--trained_ckpt
argument from the above to use the pretrained model to perform the entailment task.
- It is straightforward to calculate ROC-AUC score using the
Custom Inference
code discussed above. - Firstly, you will need to convert your data into a csv with
videopath
andcaption
using 1. and 2. from the above section. - Run the command to get the entailment score for every
videopath
andcaption
. - You can write your logic to assign a
caption
alabel = 1
if it is grounded in the video, otherwiselabel = 0
- Use the
roc_auc_score
in the sklearn here. Here, the predicted score will be the model's entailment score.
- Create a csv with two columns:
videopath
andneg_caption
. Example csv is here - Run the following command to get the generated NLE using nle_inference script.
CUDA_VISIBLE_DEVICES=0 python nle_inference.py --input_file ../../examples/test_nle.csv --output_file ../../examples/final_test_nle.csv --pretrained_ckpt <path to mplugowl-7b-video folder> --trained_ckpt <path to pytorch.bin of videocon ckpt> --use_lora --all_params
This will save the generated NLE in the final_test_nle.csv
.
In our work, we propose two methods which achieve high agreement with human evaluation.
- We use the prompt in nle_eval_prompt to get the LLM (PaLM2) decision.
- Replace
c1
with positive caption,c2
with negative caption,c3
with ground-truth NLE, andc4
with Owl-Con generated NLE. - Note: the prompt should work well with any other LLM API.
- We use this script to get the
ENTAILMENT SCORE
. - We set the premise as the ground-truth feedback and hypothesis is the model generated NLE.
- We provide the SSv2-Temporal data in eval_ssv2_temporal.csv. Here, each caption has
216
candidate videos. The number of comparisons will be216 * 18 (query actions)
. - You can use the above file directly to get the entailment scores from our finetuned model using
CUDA_VISIBLE_DEVICES=0 python entailment_inference.py --input_csv datasets/eval_ssv2_temporal.csv --output_csv eval_ssv2_temporal_scores.csv --trained_ckpt <path to pytorch.bin of videocon ckpt> --pretrained_ckpt <path to mplugowl-7b-video folder> --use_lora --all-params
(Optional) Remove --use_lora
and --trained_ckpt
argument from the above to use the pretrained model to perform the entailment task.
It will generate an output file just like eval_ssv2_temporal_scores.csv. Ignore the values in the last two columns. The number in the third column is the entailment score. 3. Use the calc_ssv2.py to get the mAP and Recall scores using the following command:
python src/calc_ssv2.py --input_file_1 datasets/eval_ssv2_temporal.csv --input_file_2 datasets/eval_ssv2_temporal_scores.csv --vid_per_caption 216
- We provide the SSv2-Events data in eval_ssv2_events.csv. Here, each caption has
588
candidate videos. The number of comparisons will be588 * 49 (query actions)
. - You can use the above file directly to get the entailment scores from our finetuned model using
CUDA_VISIBLE_DEVICES=0 python entailment_inference.py --input_csv datasets/eval_ssv2_events.csv --output_csv eval_ssv2_events_scores.csv --trained_ckpt <path to pytorch.bin of videocon ckpt> --pretrained_ckpt <path to mplugowl-7b-video folder> --use_lora --all-params
(Optional) Remove --use_lora
and --trained_ckpt
argument from the above to use the pretrained model to perform the entailment task.
It will generate an output file just like eval_ssv2_temporal_scores.csv. Ignore the values in the last two columns. The number in the third column is the entailment score.
3. Use the calc_ssv2.py to get the mAP and Recall scores using the following command:
python src/calc_ssv2.py --input_file_1 datasets/eval_ssv2_events.csv --input_file_2 datasets/eval_ssv2_events_scores.csv --vid_per_caption 588
- The videos for the dataset are available here i.e., NextQA dataset.
- The original NextQA validation set question and answers are present here. The ATP-HARD subset consists of the following indices: here. We present the ATP-Hard here.
- We use LLM API to convert the Question-Answer pairs into imperative statements. The prompt for the same is present in atp_hard_prompt.py
- The LLM-generated statements are added to atp_hard_statements.csv.
- Use the eval_nextqa script to prepare the data for entailment score generation.
python src/create_data_for_eval_nextqa.py --input_csv datasets/nextqa-atphard-statements.csv --output_csv eval-nextqa-atphard.csv --map_json map_vid_vidorID.json (from the nextqa dataset itself) --video_dir <location of the videos>
It will generate a csv like eval-nextqa-atphard.csv
6. Use the Entailment Inference
code to generate the entailment scores. It can be used to generate a file like atphard_scores. Ignore the last two columns in this file.
7. Use the following code to get the final accuracies:
python src/eval_atphard.py --input_csv_1 datasets/eval-nextqa-atphard.csv [ground-truth] --input_csv_2 datasets/atphard_scores.csv [prediction] --input_csv_3 datasets/nextqa-atphard.csv