A simple, performant re-implementation of AutoVC trained on VCTK.
The original author's repo has not released models which produce the same quality conversions as those presented in the demo. In this repo I aim to get as close as possible to the demo performance and to release the model publicly for any to use.
I use the model definition provided by the original author but use the HiFi-GAN vocoder and its associated mel-spectrogram transform. Concretely, the sample rate is set at 16kHz as in the original model and the number of training steps is increased drastically from that stated in the paper -- from 100k steps to 2.3 million steps. The speaker embedding network is also pretrained on a larger external dataset.
Otherwise, all the hyperparameters are the same as those from the paper, original author repo, or github issues of the original author repo where appropriate. The 3 model components are as follows:
- AutoVC -- trained and loaded in this repo.
- Speaker embedding network -- obtained from the pretrained simple speaker embedding repo using torch hub.
- HiFi-GAN vocoder -- using a pretrained model obtained from the original paper author.
To use the pretrained models, no dependancies aside from pytorch
, librosa==0.9.2
, scipy
, and numpy
are required.
The models use torch hub, making loading exceedingly simple:
Step 1: load all the models
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load the pretrained autovc model:
autovc = torch.hub.load('RF5/simple-autovc', 'autovc').to(device)
autovc.eval()
# Load the pretrained hifigan model:
hifigan = torch.hub.load('RF5/simple-autovc', 'hifigan').to(device)
hifigan.eval()
# Load speaker embedding model:
sse = torch.hub.load('RF5/simple-speaker-embedding', 'gru_embedder').to(device)
sse.eval()
Step 2: do inference on some utterances of your choice
# Get mel spectrogram
mel = autovc.mspec_from_file('example/source_uttr.flac')
# or autovc.mspec_from_numpy(numpy array, sampling rate) if you have a numpy array
# Get embedding for source speaker
sse_src_mel = sse.melspec_from_file('example/source_uttr.flac')
with torch.no_grad():
src_embedding = sse(sse_src_mel[None].to(device))
# Get embedding for target speaker
sse_trg_mel = sse.melspec_from_file('example/target_uttr.flac')
with torch.no_grad():
trg_embedding = sse(sse_trg_mel[None].to(device))
# Do the actual voice conversion!
with torch.no_grad():
spec_padded, len_pad = autovc.pad_mspec(mel)
x_src = spec_padded.to(device)[None]
s_src = src_embedding.to(device)
s_trg = trg_embedding.to(device)
x_identic, x_identic_psnt, _ = autovc(x_src, s_src, s_trg)
if len_pad == 0: x_trg = x_identic_psnt[0, 0, :, :]
else: x_trg = x_identic_psnt[0, 0, :-len_pad, :]
# x_trg is now the converted spectrogram!
Step 3: vocode output spectrogram:
# Make a vocode function
@torch.no_grad()
def vocode(spec):
# denormalize mel-spectrogram
spec = autovc.denormalize_mel(spec)
_m = spec.T[None]
waveform = hifigan(_m.to(device))[0]
return waveform.squeeze()
converted_waveform = vocode(x_trg) # output waveform
# Save waveform as wav file
import soundfile as sf
sf.write('converted_uttr.flac', converted_waveform.cpu().numpy(), 16000)
Doing this for the example utterance in the example/
folder yields the following:
- Source utterance: 1.1 raw 48kHz:
source_uttr.mp4
1.2 vocoded 16kHz:
in.mp4
- Reference style utterance:
2.1 raw 48kHz:
target_uttr.mp4
2.2 vocoded 16kHz:
ref.mp4
- Converted output utterance (vocoded 16kHz):
converted_uttr.mp4
Note as well that the input or reference utterance may be speakers unseen during training, or any audio file if you are feeling very brave.
To train the model, simply set the root data directory in hp.py
, and run train.py
with the best arguments for your use case.
Note that train.py
is currently set up to load data in a VCTK-style folder format, so you may need to rework it to your dataset if you use a new one.
You can save time during training by pre-computing the mel-spectrograms from the waveforms using spec_utils.py
, in which case just pass the precomputed mel-spectrogram folder to train.py
as the appropriate argument.
Please see the details in the original repo if you wish to further train it, but it is pretty good and works well even on several unseen languages.
Please see the instructions in the HiFi-GAN repo on how to fine-tune the vocoder. To do this, you would need to generate reconstructed AutoVC spectrogram outputs and pair them with the ground-truth waveforms. HiFi-GAN fine-tuning will then use teacher forcing to make the vocoder better adapt to AutoVC's output. Remember to set the sampling rate to 16kHz for this step, as the default for HiFi-GAN is 22.05kHz.