This repository contains the source code for the paper Generative Models for Improved Naturalness, Intelligibility, and Voicing of Whispered Speech. The goal was to adapt MelGAN and VQ-VAE systems to convert whispered speech into normal speech.
The MelGAN code used as basis for this project can be found here.
The VQ-VAE model is based on Deepmind's VQ-VAE implementation (see here), Andrej Karpathy's implementation, and this repo.
The WaveGlow system is a slightly adapted version of the code provided by NVIDIA.
Please visit our demo website for samples.
The repo is structured as follows:
.
└── speech-conversion
├── melgan -> Sources for training MelGAN models
│ └── mel2wav
├── vqvae -> Sources for training VQ-VAE models
└── waveglow -> Sources for training WaveGlow models
└── tacotron2
The code is designed to be used with the wTIMIT corpus. The corpus can be downloaded here (Note: Requires authentication). The wTIMIT dataset is sampled at 44 kHz and needs to be resampled to 16 kHz. The 16 kHz setting is hardcoded in several places of this project. Hence, using a different sample rate without any source code modifications will likely lead to errors.
Create a directory with all samples stored for example in the wavs/
subfolder.
You'll need to provide filelists containing the your test and training data.
A simple way to create these filelists looks as follows:
ls wavs/*n.WAV | tail -n+10 > train_files.txt
ls wavs/*w.WAV | head -n10 > test_files.txt
ls wavs/*n.WAV | head -n10 > normal_test_files.txt -> normal test data for waveglow
Note that we only grab the whispered utterances (the ones with "w" at the end) for the test set.
See the following scripts for examples on how to train MelGAN, VQ-VAE and WaveGlow models:
train_melgan.sh
- Add your own paths to the variables
SAVE_PATH
,DATA_PATH
, andLOAD_PATH
- Add your own paths to the variables
train_vqvae.sh
- Add your own paths to the variables
SAVE_PATH
,DATA_PATH
,LOAD_PATH
, andWG_PATH
- Add your own paths to the variables
train_waveglow.sh
- Create your own config file for the WaveGlow model or use an existing one and point to it via the
--config
flag - Note that the original WaveGlow model is incompatible with the Mel spectrogram features generated for MelGAN and VQ-VAE training
- Hence using a pretrained WaveGlow model will not yield good results, when spectrogram inputs generated by VQ-VAE are used as input
- A training script that provides a compatible model can be found in
speech-conversion/waveglow/train_melgan_comapt.py
and is also referenced intrain_waveglow.sh
.
- Create your own config file for the WaveGlow model or use an existing one and point to it via the
Note: The Python scripts need to run with the -m
command line flag and without the .py
extension (e.g. python -m app.sub1.mod1
) due to the relative imports used across the sub-packages.
Inference can be done with the following scripts:
inference_melgan.sh
inference_vqvae.sh