This repository is the official implementation of "VTC: Improving Video-Text Retrieval with User Comments". Code to to download the dataset can be found here.
Our main contributions:
-
We introduce a new dataset VTC of videos, titles, and comments.
-
We introduce a new hierarchical attention method that learns how to identify relevant auxiliary information, and that can learn representations that even generalize to other datasets.
-
We quantify the value of the comments modality for video-text learning and we show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations.
Project template is from https://github.com/victoresque/pytorch-template.
In order to install the conda environment, Anaconda will need to be installed first.
# Clone the repository
git clone https://github.com/unitaryai/VTC
cd VTC
# Create a new conda environment with dependencies
conda env create -f environment.yml
All relevant configs will need to be updated with the right paths to the downloaded files e.g. dataset csv file, root directory where the images/videos are or model paths.
Generally, the Context Adapter Module models in the paper are trained using:
# image preview
python train.py --config "configs/pretrained_clip_comments_attention.jsonc"
# video
python train.py --config "configs/pretrained_clip_timesformer_comments_attention.jsonc"
For more configurable flag options run python train.py --help
.
To replicate our experiments in the paper, run the following scripts:
# training image baselines (Table 2)
bash experiments/train/image_baselines.sh
# training CAM module with varying number of comments (Figure 4)
bash experiments/train/image_vary_num_comments.sh
# training video models using timesformer (Table 7)
bash experiments/train/video_timesformer.sh
When training the frozen models, we recommend saving the embeddings for the visual side to speed up the training. This can be done by running:
python scripts/get_clip_vit_embeddings.py
The saved embeddings can be passed to the dataset loader in the config as cached_vision_features
or simply passed as an additional flag when running with python train.py --cached_vision_features $SAVED_FEATURES_PATH
.
For audio experiments, first clone the GDT repository and download the model weights trained on the IG65M dataset.
git clone https://github.com/facebookresearch/GDT.git
wget https://dl.fbaipublicfiles.com/GDT/gdt_IG65M.pth
Then compute the audio embeddings for the videos in the VTC dataset that will be used in the training.
python scripts/get_audio_embeddings.py
To evaluate a model, update the relevant configs with the right paths to the downloaded files e.g. dataset csv file, root directory where the images/videos are or model paths. Then you can run:
# image
python evaluation/eval.py --config "configs/pretrained_clip_comments_attention.jsonc" --resume $CKPT_PATH
# video
python evaluation/eval.py --config "configs/pretrained_clip_timesformer_comments_attention.jsonc" --resume $CKPT_PATH
For more configurable flag options run python evaluation/eval.py --help
.
To replicate our experiments in the paper, run the following scripts:
# testing image baselines (Table 2)
bash experiments/eval/image_baselines.sh
# testing CAM module with varying number of comments (Figure 4)
bash experiments/eval/image_vary_num_comments.sh
# testing video models using timesformer (Table 7) and testing on different datasets (VTC, KineticsComments, Livebot)
bash experiments/eval/video_timesformer.sh
We will be releasing pre-trained models soon.
We show that our method is robust to different combinations of modalities, both at train and at test time.
training | inference | TVR R@1 | TVR R@10 | VTR R@1 | VTR R@10 |
---|---|---|---|---|---|
CLIP | img+title | 11.1 | 26.0 | 11.1 | 25.3 |
img+title | img+title | 15.5 | 34.9 | 14.4 | 33.4 |
img+title+cmts | img+title | 15.5 | 34.5 | 14.4 | 33.3 |
img+title+cmts | img+title+cmts | 18.0 | 43.2 | 18.7 | 43.9 |
img+title+cmts | img+title | 14.9 | 34.2 | 14.2 | 32.9 |
img+title+cmts | img+title+cmts | 28.2 | 51.2 | 25.1 | 49.9 |
img+title+cmts+audio | img+title | 15.4 | 34.0 | 14.3 | 32.9 |
img+title+cmts+audio | img+title+audio | 15.8 | 36.9 | 12.2 | 30.4 |
img+title+cmts+audio | img+title+cmts+audio | 19.6 | 45.6 | 20.6 | 47.2 |
These experiments are using video frames and were trained adapting the video branch with comments, with either one or eight frames from the video. Showing Recall@10.
inference | #frames | VTC VTR | VTC TVR | KineticsComms VTR | KineticsComms TVR | LiveBotEN VTR | LiveBotEN TVR |
---|---|---|---|---|---|---|---|
video | 1 | 28.9 | 28.3 | 48.8 | 46.9 | 48.0 | 49.0 |
video+comments | 1 | 40.8 | 41.0 | 61.1 | 59.2 | 64.0 | 64.0 |
mean-pooling | 8 | 19.3 | 24.2 | 54.1 | 49.8 | 69.0 | 66.0 |
video | 8 | 28.9 | 27.6 | 56.9 | 55.8 | 70.0 | 72.0 |
video+comments | 8 | 41.5 | 41.9 | 68.0 | 66.1 | 69.0 | 80.0 |
@inproceedings{hanu2022vtc,
title={VTC: Improving Video-Text Retrieval with User Comments},
author={Laura Hanu and James Thewlis and Yuki M. Asano and Christian Rupprecht},
booktitle={ECCV},
year={2022}
}