Skip to content

This repository provides the code for "Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining", presented at DCASE 2024. The paper addresses the challenge of audio retrieval using vocal imitations as queries, proposing a dual encoder architecture that leverages pretrained CNNs and an adapted NT-Xent loss for fine-tuning.

Notifications You must be signed in to change notification settings

Jonathan-Greif/QBV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Query-by-Vocal Imitation (QBV)

In this repository, we publish the model checkpoints and the code described in the paper:

Abstract

Query-by-Vocal Imitation (QBV) is about searching audio files within databases using vocal imitations created by the user’s voice. Since most humans can effectively communicate sound concepts through voice, QBV offers the more intuitive and convenient approach compared to text-based search. To fully leverage QBV, developing robust audio feature representations for both the vocal imitation and the original sound is crucial. In this paper, we present a new system for QBV that utilizes the feature extraction capabilities of Convolutional Neural Networks pre-trained with large-scale general-purpose audio datasets. We integrate these pre-trained models into a dual encoder architecture and fine-tune them end-to-end using contrastive learning. A distinctive aspect of our proposed method is the fine-tuning strategy of pre-trained models using an adapted NT-Xent loss for contrastive learning, creating a shared embedding space for reference recordings and vocal imitations. The proposed system significantly enhances audio retrieval performance, establishing a new state of the art on both coarse- and fine-grained QBV tasks.

Getting Started

conda create -n qbv python=3.8

conda activate qbv

pip install -r requirements.txt

Experiments

Training

Coarse-grained

python ex_qbv.py --roll --fold=0 --id=001

Fine-grained

python ex_qbv.py --roll --fine_grained --id=001

Testing

Coarse-grained

python test_coarse.py --own_module

python test_coarse.py --arch=M-VGGish --sr_down=16000 --dur=15.4

python test_coarse.py --arch=2DFT --sr_down=8000 --dur=15.4

Fine-grained

python test_fine.py --own_module

python test_fine.py --arch=M-VGGish --sr_down=16000 --dur=15.4

python test_fine.py --arch=2DFT --sr_down=8000 --dur=15.4

Contact

For questions or inquiries, please contact me at jonathan.greif@jku.at.

If you use this code, please cite our paper:

@inproceedings{Greif2024,
    author = "Greif, Jonathan and Schmid, Florian and Primus, Paul and Widmer, Gerhard",
    title = "Improving Query-By-Vocal Imitation with Contrastive Learning and Audio Pretraining",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)",
    address = "Tokyo, Japan",
    month = "October",
    year = "2024",
    pages = "51--55"
}

About

This repository provides the code for "Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining", presented at DCASE 2024. The paper addresses the challenge of audio retrieval using vocal imitations as queries, proposing a dual encoder architecture that leverages pretrained CNNs and an adapted NT-Xent loss for fine-tuning.

Topics

Resources

Stars

Watchers

Forks

Languages