- Implementation of "Neural Speech Synthesis with Transformer Network"
- This is implemented for FastSpeech
- Download and extract the LJ Speech dataset
- Make
preprocessed
folder in LJSpeech directory and makechar_seq
&phone_seq
&melspectrogram
folder in it - Set
data_path
inhparams.py
as the LJSpeech folder - Using
prepare_data.ipynb
, prepare melspectrogram and text (converted into indices) tensors. python train.py
- Encoder Alignments
You can hear the audio samples here
- Unlike the original paper, I didn't use the encoder-prenet following espnet
- I apply additional "guided attention loss" to the two heads of the last two layers
- Batch size is important, so I use gradient accumulation
- You can also use DataParallel. Change the
n_gpus
,batch_size
,accumulation
appropriately.
- Dynamic batch
-
For fastspeech, generated melspectrograms and attention matrix should be saved for later.
1-1. Setteacher_path
inhparams.py
and makealignments
andtargets
directories there.
1-2. Usingprepare_fastspeech.ipynb
, prepare alignmetns and targets. -
To draw attention plots for every each head, I change return values of the "torch.nn.functional.multi_head_attention_forward()"
#before
return attn_output, attn_output_weights.sum(dim=1) / num_heads
#after
return attn_output, attn_output_weights
- Among
num_layers*num_heads
attention matrices, the one with the highest focus rate is saved.
1.NVIDIA/tacotron2: https://github.com/NVIDIA/tacotron2
2.espnet/espnet: https://github.com/espnet/espnet
3.soobinseo/Transformer-TTS: https://github.com/soobinseo/Transformer-TTS