Learning Audio-Visual Correlations from Variational Cross-Modal Generations

This is the code implementation for the ICCASP2021 paper Learning Audio-Visual Correlations from Variational Cross-Modal Generations. In this work, we propose a Variational Autencoder with Multiple encoders and a Shared decoder (MS-VAE) framework for processing the data from visual and audio modalities. We use the AVE dataset for experiments, and thank the authors of the previous work for sharing their codes and data.

1. We implement the project using

Python 3.6
Pytorch 1.2

2. Training and pre-trained models

Please download the audio and visual features from here, and place the data files in the data folder. Note that we use the features for CML task for experiments.
To train the model, run the msvae.py.
For the cross-modal localization task, run the cml.py.
For the cross-modal retrieval task, run the retrieval.py.
The pre-trained models are also available for download: audio and visual.

3. Citation

Please consider citing our paper if you find it useful.

@InProceedings{zhu2021learning,    
  author = {Zhu, Ye and Wu, Yu and Latapie, Hugo and Yang, Yi and Yan, Yan},    
  title = {Learning Audio-Visual Correlations from Variational Cross-Modal Generation},    
  booktitle = {International Conference on Acoustics, Speech, and Signal Processing(ICCASP)},    
  year = {2021} 
  }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Learning Audio-Visual Correlations from Variational Cross-Modal Generations

1. We implement the project using

2. Training and pre-trained models

3. Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Learning Audio-Visual Correlations from Variational Cross-Modal Generations

1. We implement the project using

2. Training and pre-trained models

3. Citation