Torch implementation of NANSY, Neural Analysis and Synthesis, arXiv:2110.14513
- Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations, Choi et al., 2021. [arXiv:2110.14513]
Tested in python 3.7.9 conda environment.
Initialize the submodule and patch.
git submodule init --update
cd hifi-gan; patch -p0 < ../hifi-gan-diff
Download LibriTTS[openslr:60], LibriSpeech[openslr:12] and VCTK[official] datasets.
Dump the dataset for training.
python -m speechset.utils.dump \
--out-dir ./datasets/dumped
To train model, run train.py
python train.py \
--data-dir ./datasets/dumped
To start to train from previous checkpoint, --load-epoch is available.
python train.py \
--data-dir ./datasets/dumped \
--load-epoch 20 \
--config ./ckpt/t1.json
Checkpoint will be written on TrainConfig.ckpt, tensorboard summary on TrainConfig.log.
tensorboard --logdir ./log
To inference model, run inference.py
python inference.py \
--ckpt ./ckpt/libri100_73.ckpt \
--hifi-ckpt ./ckpt/hifigan/g_02500000 \
--hifi-config ./ckpt/hifigan/config.json \
--context ./sample1.wav \
--identity ./sample2.wav
[TODO] Pretrained checkpoints will be relased on releases.
To use pretrained model, download files and unzip it. Followings are sample script.
from nansy import Nansy
ckpt = torch.load('t1_200.ckpt', map_location='cpu')
nansy = Nansy.load(ckpt)
nansy.eval()
Epoch: 65