Transformer-TTS

Implementation of "Neural Speech Synthesis with Transformer Network"
This is implemented for FastSpeech

Training

Download and extract the LJ Speech dataset
Make preprocessed folder in LJSpeech directory and make char_seq & phone_seq & melspectrogram folder in it
Set data_path in hparams.py as the LJSpeech folder
Using prepare_data.ipynb, prepare melspectrogram and text (converted into indices) tensors.
python train.py

Training curve (Orange: character / Blue: phoneme)

Stop prediction loss (train / val)
Guided attention loss (train / val)
L1 loss (train / val)

Alignments (Left: character / Right: phoneme)

Encoder Alignments

- Decoder Alignments

- Encoder-Decoder Alignments

- Melspectrogram (target / before / after POSTNET)

- Stop prediction

Audio Samples

You can hear the audio samples here

Notice

Unlike the original paper, I didn't use the encoder-prenet following espnet
I apply additional "guided attention loss" to the two heads of the last two layers
Batch size is important, so I use gradient accumulation
You can also use DataParallel. Change the n_gpus, batch_size, accumulation appropriately.

TODO

Dynamic batch

Fastspeech

For fastspeech, generated melspectrograms and attention matrix should be saved for later.
1-1. Set teacher_path in hparams.py and make alignments and targets directories there.
1-2. Using prepare_fastspeech.ipynb, prepare alignmetns and targets.
To draw attention plots for every each head, I change return values of the "torch.nn.functional.multi_head_attention_forward()"

#before
return attn_output, attn_output_weights.sum(dim=1) / num_heads  

#after  
return attn_output, attn_output_weights

Among num_layers*num_heads attention matrices, the one with the highest focus rate is saved.

Reference

1.NVIDIA/tacotron2: https://github.com/NVIDIA/tacotron2
2.espnet/espnet: https://github.com/espnet/espnet
3.soobinseo/Transformer-TTS: https://github.com/soobinseo/Transformer-TTS

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
figures		figures
filelists		filelists
modules		modules
text		text
training_log		training_log
utils		utils
waveglow		waveglow
wavs		wavs
LICENSE		LICENSE
README.md		README.md
audio_processing.py		audio_processing.py
generate_samples.ipynb		generate_samples.ipynb
hparams.py		hparams.py
index.html		index.html
inference.ipynb		inference.ipynb
layers.py		layers.py
prepare_data.ipynb		prepare_data.ipynb
prepare_fastspeech.ipynb		prepare_fastspeech.ipynb
requirement.txt		requirement.txt
stft.py		stft.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer-TTS

Training

Training curve (Orange: character / Blue: phoneme)

Alignments (Left: character / Right: phoneme)

Audio Samples

Notice

TODO

Fastspeech

Reference

About

Releases

Packages

Languages

License

Deepest-Project/Transformer-TTS

Folders and files

Latest commit

History

Repository files navigation

Transformer-TTS

Training

Training curve (Orange: character / Blue: phoneme)

Alignments (Left: character / Right: phoneme)

Audio Samples

Notice

TODO

Fastspeech

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages