Implementation of our paper Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments. WASE first explicitly models start/end time of speech (onset/offset cues) in speaker extraction problem.
WASE is adapted based on our previous proposed framework, which includes five modules: voiceprint encoder, onset/offset detector, speech encoder, speech decoder, and speaker extraction module.
In this work, we focus on the onset/offset cues of speech and verify their effectiveness in speaker extraction task. We also combine the onset/offset cues and voiceprint cue. Onset/offset cues model start/end time of speech and voiceprint cue models the voice characteristics. The combination of two perceptual cues bring a significant performance improvement, while extra needed parameters are negligible. Please see the figure below for detailed model structure.
- Python 3.7
- Pytorch 1.0.1
- pysoundfile 0.10.2
- librosa 0.7.2
- Please refer to environment.yml for details.
The training samples are generated by randomly selecting speeches of different speakers from si_tr_s of WSJ0, and mixing them at various signal-to-noise ratios (SNR). The evaluating samples are generated by fixed list ./data/wsj/mix_2_spk_voiceP_tt_WSJ.txt. Please modify the dataset path in ./data/preparedata.py according to your actual path.
data_config['speechWavFolderList'] = ['/home/aa/WSJ/wsj0/si_tr_s/']
data_config['spk_test_voiceP_path'] = './data/wsj/mix_2_spk_voiceP_tt_WSJ.txt'
You may need the command below to modify the evaluating data path.
sed -i 's/home\/aa/YOUR PATH/g' data/wsj/mix_2_spk_voiceP_tt_WSJ.txt
We advise you to utilize the pickle file, which could speed up experiments by saving time for frequency resampling. You could modify the pickle path to anywhere you like.
data_config['train_sample_pickle'] = '/data1/haoyunzhe/interspeech/dataset/wsj0_pickle/train_sample.pickle'
data_config['test_sample_pickle'] = '/data1/haoyunzhe/interspeech/dataset/wsj0_pickle/test_sample.pickle'
Simply run this command:
python eval.py
This will load the model onset_offset_voiceprint.pt
and verify its performance. It will cost about an hour.
PS. The default setting is using both onset/offset and voiceprint cues. If you want to train a model based on only onset/offset cues or voiceprint cue, please modify the parameters in config.yaml.
ONSET: 1
OFFSET: 1
VOICEPRINT: 1
python train.py
python eval.py
tensorboard --logdir ./log
- Listen to audio samples at ./assets/demo.
- Spectrogram samples (clean/mixture/prediction).
Methods | #Params | SDRi(dB) |
---|---|---|
SBF-MTSAL | 19.3M | 7.30 |
SBF-MTSAL-Concat | 8.9M | 8.39 |
SpEx | 10.8M | 14.6 |
SpEx+ | 13.3M | 17.2 |
WASE (onset / offset + voiceprint) | 7.5M | 17.05 |
If you want to reproduce the results above, you need to decay learning rate and freeze voiceprint encoder in config.yaml when the model is close to convergence.
FREEZE_VOICEPRINT: 0
learning_rate: 0.001
If you find this repo helpful, please consider citing:
@article{hao2020wase,
title={Wase: Learning When to Attend for Speaker Extraction
in Cocktail Party Environments},
author={Hao, Yunzhe and Xu, Jiaming and Zhang, Peng and Xu, Bo}
}
@article{hao2020unified,
title={A Unified Framework for Low-Latency Speaker Extraction in Cocktail Party Environments},
author={Hao, Yunzhe and Xu, Jiaming and Shi, Jing and Zhang, Peng and Qin, Lei and Xu, Bo},
journal={Proc. Interspeech 2020},
pages={1431--1435},
year={2020}
}
For commercial use of this code and models, please contact: Yunzhe Hao(haoyunzhe2017@ia.ac.cn).
This repository contains codes adapted/copied from the followings:
- ./models/tasnet.py from Conv-TasNet (CC BY-NC-SA 3.0);
- ./models/tcn.py from Conv-TasNet (CC BY-NC-SA 3.0).