This is the code implementation for the ICCASP2021 paper Learning Audio-Visual Correlations from Variational Cross-Modal Generations. In this work, we propose a Variational Autencoder with Multiple encoders and a Shared decoder (MS-VAE) framework for processing the data from visual and audio modalities. We use the AVE dataset for experiments, and thank the authors of the previous work for sharing their codes and data.
Python 3.6
Pytorch 1.2
Please download the audio and visual features from here, and place the data files in the data folder. Note that we use the features for CML task for experiments.
To train the model, run the msvae.py
.
For the cross-modal localization task, run the cml.py
.
For the cross-modal retrieval task, run the retrieval.py
.
The pre-trained models are also available for download: audio and visual.
Please consider citing our paper if you find it useful.
@InProceedings{zhu2021learning,
author = {Zhu, Ye and Wu, Yu and Latapie, Hugo and Yang, Yi and Yan, Yan},
title = {Learning Audio-Visual Correlations from Variational Cross-Modal Generation},
booktitle = {International Conference on Acoustics, Speech, and Signal Processing(ICCASP)},
year = {2021}
}