Pritam Sarkar Ali Etemad
We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard synchronous cross-modal relations, CrissCross also learns asynchronous cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound.
- Paper
- Pretrained model weights
- Evaluation codes
- Training codes
We present the top-1 accuracy averaged over all the splits of each dataset. Please note that the results mentioned below are obtained by full-finetuning on UCF101 and HMDB51, and linear classifier on ESC50 and DCASE.
Pretraining Dataset | Pretraining Size | UCF101 | HMDB51 | ESC50 | DCASE | Model |
---|---|---|---|---|---|---|
Kinetics-Sound | 22K | 88.3% | 60.5% | 82.8% | 93.0% | visual; audio |
Kinetics400 | 240K | 91.5% | 64.7% | 86.8% | 96.0% | visual; audio |
AudioSet | 1.8M | 92.4% | 67.4% | 90.5% | 97.0% | visual; audio |
List of dependencies can be found here. You can create an environment as conda create --name crisscross --file requirements.txt
Please make sure to keep the datasets in their respective directories, and change the path in /tools/paths
accordingly. The sources of all the public datasets used in this study are mentioned here.
- AudioSet: Please check this repository to download AudioSet.
- Kinetics400: You can either use a crawler (similar to the one available for AudioSet) to download the Kinetics400, or simply download from the Amazon AWS, prepared by CVD Foundation.
- UCF101: Website to download.
- HMDB51: Website to download.
- ESC50: Website to download.
- DCASE: Website to download.
Here are a few examples on how to train CrissCross in diffierent GPU setups. A batch size of 2048 can be used to train on 8X RTX6000 or 8X V100 or similar GPUs. To know more about PyTorch distributed training, please see Pytorch official documentation.
cd train
python main_pretext_audiovisual.py \
--world-size 1 --rank 0 \
--quiet --sub_dir 'pretext' \
--config-file 'audvid_crisscross' \
--db 'kinetics400'
# MASTER="127.0.0.1" or HOSTNAME
# MPORT="8888" OR ANY FREE PORT
cd train
python main_pretext_audiovisual.py \
--dist-url tcp://${MASTER}:${MPORT} \
--dist-backend 'nccl' \
--multiprocessing-distributed \
--world-size 1 --rank 0 \
--quiet --sub_dir 'pretext' \
--config-file 'audvid_crisscross' \
--db 'kinetics400'
# MASTER="127.0.0.1" or HOSTNAME
# MPORT="8888" OR ANY FREE PORT
cd train
# Node 0:
python main_pretext_audiovisual.py \
--dist-url tcp://${MASTER}:${MPORT} \
--dist-backend 'nccl' \
--multiprocessing-distributed \
--world-size 2 --rank 0 \
--quiet --sub_dir 'pretext' \
--config-file 'audvid_crisscross' \
--db 'kinetics400'
# Node 1:
python main_pretext_audiovisual.py \
--dist-url tcp://${MASTER}:${MPORT} \
--dist-backend 'nccl' \
--multiprocessing-distributed \
--world-size 2 --rank 1 \
--quiet --sub_dir 'pretext' \
--config-file 'audvid_crisscross' \
--db 'kinetics400'
You can directly use the given weights to evaluate the model on the following benchmarks, using the commands given below. Please make sure to save the model weights to the following location: /path/to/model
. Downstream evaluation is performed on a single Nvidia RTX 6000 GPU. Note, codes are tested on a linux machine.
UCF101
# full-finetuning
cd evaluate
# 8 frame evaluation
python eval_video.py --world-size 1 --rank 0 --gpu 0 --db 'ucf101' --config-file kinetics400/full_ft_8f_fold1 --pretext_model /path/to/model
# 32 frame evaluation
python eval_video.py --world-size 1 --rank 0 --gpu 0 --db 'ucf101' --config-file kinetics400/full_ft_32f_fold1 --pretext_model /path/to/model
HMDB51
# full-finetuning
cd evaluate
# 8 frame evaluation
python eval_video.py --world-size 1 --rank 0 --gpu 0 --db 'hmdb51' --config-file kinetics400/full_ft_8f_fold1 --pretext_model /path/to/model
# 32 frame evaluation
python eval_video.py --world-size 1 --rank 0 --gpu 0 --db 'hmdb51' --config-file kinetics400/full_ft_32f_fold1 --pretext_model /path/to/model
ESC50
# linear evaluation using SVM
cd evaluate
# 2-second evaluation
python eval_audio.py --world-size 1 --rank 0 --gpu 0 --db 'esc50' --config-file config_fold1_2s --pretext_model /path/to/model
# 5-second evaluation
python eval_audio.py --world-size 1 --rank 0 --gpu 0 --db 'esc50' --config-file config_fold1_5s --pretext_model /path/to/model
DCASE
# linear evaluation using fc tuning
cd evaluate
# 2-second evaluation
python eval_audio.py --world-size 1 --rank 0 --gpu 0 --db 'dcase' --config-file config_2s --pretext_model /path/to/model
If you find this repository useful, please consider giving a star ⭐ and citation using the given BibTeX entry:
@misc{sarkar2021crisscross,
title={Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity},
author={Pritam Sarkar and Ali Etemad},
year={2021},
eprint={2111.05329},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
We are grateful to Bank of Montreal and Mitacs for funding this research. We are also thankful to SciNet HPC Consortium for helping with the computation resources.
You may directly contact me at pritam.sarkar@queensu.ca or connect with me on LinkedIn.