This project is named as Grad-SVC, or GVC for short. Its core technology is diffusion, but so different from other diffusion based SVC models. Codes are adapted from Grad-TTS
and whisper-vits-svc
. So the features from whisper-vits-svc
are used in this project. By the way, Diff-VC is a follow-up of Grad-TTS, Diffusion-Based Any-to-Any Voice Conversion
Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
The framework of grad-svc-v1
The framework of grad-svc-v2 & v3, encoder:768->512, diffusion:64->96
Elysia_Grad_SVC.mp4
-
Such beautiful codes from Grad-TTS
easy to read
-
Multi-speaker based on speaker encoder
-
No speaker leaky based on
Perturbation
&Instance Normlize
&GRL
-
No electronic sound
-
Integrated DPM Solver-k for less steps
-
Integrated Fast Maximum Likelihood Sampling Scheme, for less steps
-
Conditional Flow Matching (V3), first used in SVC
-
Rectified Flow Matching (TODO)
-
Install project dependencies
pip install -r requirements.txt
-
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put
best_model.pth.tar
intospeaker_pretrain/
. -
Download hubert_soft model,put
hubert-soft-0d54a1f4.pt
intohubert_pretrain/
. -
Download pretrained nsf_bigvgan_pretrain_32K.pth, and put it into
bigvgan_pretrain/
.Performance Bottleneck: Generator and Discriminator are 116Mb, but Generator is only 22Mb
系统性能瓶颈:生成器和判别器一共116M,而生成器只有22M
-
Download pretrain model gvc.pretrain.pth, and put it into
grad_pretrain/
.python gvc_inference.py --model ./grad_pretrain/gvc.pretrain.pth --spk ./assets/singers/singer0001.npy --wave test.wav
For this pretrain model,
temperature
is settemperature=1.015
ingvc_inference.py
to get good result.
Put the dataset into the data_raw
directory following the structure below.
data_raw
├───speaker0
│ ├───000001.wav
│ ├───...
│ └───000xxx.wav
└───speaker1
├───000001.wav
├───...
└───000xxx.wav
After preprocessing you will get an output with following structure.
data_gvc/
└── waves-16k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── waves-32k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── mel
│ └── speaker0
│ │ ├── 000001.mel.pt
│ │ └── 000xxx.mel.pt
│ └── speaker1
│ ├── 000001.mel.pt
│ └── 000xxx.mel.pt
└── pitch
│ └── speaker0
│ │ ├── 000001.pit.npy
│ │ └── 000xxx.pit.npy
│ └── speaker1
│ ├── 000001.pit.npy
│ └── 000xxx.pit.npy
└── hubert
│ └── speaker0
│ │ ├── 000001.vec.npy
│ │ └── 000xxx.vec.npy
│ └── speaker1
│ ├── 000001.vec.npy
│ └── 000xxx.vec.npy
└── speaker
│ └── speaker0
│ │ ├── 000001.spk.npy
│ │ └── 000xxx.spk.npy
│ └── speaker1
│ ├── 000001.spk.npy
│ └── 000xxx.spk.npy
└── singer
├── speaker0.spk.npy
└── speaker1.spk.npy
- Re-sampling
- Generate audio with a sampling rate of 16000Hz in
./data_gvc/waves-16k
python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-16k -s 16000
- Generate audio with a sampling rate of 32000Hz in
./data_gvc/waves-32k
python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-32k -s 32000
- Generate audio with a sampling rate of 16000Hz in
- Use 16K audio to extract pitch
python prepare/preprocess_f0.py -w data_gvc/waves-16k/ -p data_gvc/pitch
- use 32k audio to extract mel
python prepare/preprocess_spec.py -w data_gvc/waves-32k/ -s data_gvc/mel
- Use 16K audio to extract hubert
python prepare/preprocess_hubert.py -w data_gvc/waves-16k/ -v data_gvc/hubert
- Use 16k audio to extract timbre code
python prepare/preprocess_speaker.py data_gvc/waves-16k/ data_gvc/speaker
- Extract the average value of the timbre code for inference
python prepare/preprocess_speaker_ave.py data_gvc/speaker/ data_gvc/singer
- Use 32k audio to generate training index
python prepare/preprocess_train.py
- Training file debugging
python prepare/preprocess_zzz.py
- Start training
python gvc_trainer.py
- Resume training
python gvc_trainer.py -p logs/grad_svc/grad_svc_***.pth
- Log visualization
tensorboard --logdir logs/
-
Export inference model
python gvc_export.py --checkpoint_path logs/grad_svc/grad_svc_***.pth
-
Inference
python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --rature 1.015 --shift 0
temperature=1.015, needs to be adjusted to get good results; Recommended range is (1.001, 1.035).
-
Inference step by step
- Extract hubert content vector
python hubert/inference.py -w test.wav -v test.vec.npy
- Extract pitch to the csv text format
python pitch/inference.py -w test.wav -p test.csv
- Convert hubert & pitch to wave
python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0
- Extract hubert content vector
https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS
https://github.com/huawei-noah/Speech-Backbones/tree/main/DiffVC
https://github.com/facebookresearch/speech-resynthesis
https://github.com/cantabile-kwok/VoiceFlow-TTS
https://github.com/shivammehta25/Matcha-TTS
https://github.com/shivammehta25/Diff-TTSG
https://github.com/majidAdibian77/ResGrad
https://github.com/LuChengTHU/dpm-solver
https://github.com/gmltmd789/UnitSpeech
https://github.com/zhenye234/CoMoSpeech
https://github.com/seahore/PPG-GradVC
https://github.com/thuhcsi/LightGrad
https://github.com/lmnt-com/wavegrad
https://github.com/naver-ai/facetts
https://github.com/jaywalnut310/vits
https://github.com/NVIDIA/BigVGAN
https://github.com/bshall/soft-vc
https://github.com/mozilla/TTS
https://github.com/ubisoft/ubisoft-laforge-daft-exprt
https://github.com/yl4579/StyleTTS-VC