Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the result obtained by eval_model or synthesis is much worse than which is obtained by train process #201

Open
Eleanor456 opened this issue May 30, 2020 · 8 comments

Comments

@Eleanor456
Copy link

when I generated the audio by the checkpoint with 32000 steps, the output was pure noise. And the alignment pictures are always empty as following. How can I get the result close normal sound which obtained during training.

step000034000_text1_multispeaker10_alignment

@marianbasti
Copy link

What datasets and presets are you using?

@Eleanor456
Copy link
Author

您正在使用哪些数据集和预设?

Chinese datasets with 61 speakers, and the preset I have modified according to the deepvoice3_vctk.json

@marianbasti
Copy link

What frontend selected?
I'm trying to train on spanish speakers and the results are a litte gibberish, but not noise.

@Eleanor456
Copy link
Author

Eleanor456 commented Jun 1, 2020

What frontend selected?
I'm trying to train on spanish speakers and the results are a litte gibberish, but not noise.

I convert the transcript to pinyin form, so I selected the en frontend. I think the bad result may be the epochs is not enough.

@marianbasti
Copy link

Shouldn't be so noisy. This is what i get with 40000 steps on 13 speaker dataset.
step000040000_text3_multispeaker10_alignment

es frontend, so no phonetics dictionary

@Eleanor456
Copy link
Author

Eleanor456 commented Jun 1, 2020

Shouldn't be so noisy. This is what i get with 40000 steps on 13 speaker dataset.
step000040000_text3_multispeaker10_alignment

es frontend, so no phonetics dictionary

This is the result after training for 61000 steps with batch size of 64.
image

It is slightly better than before, so I plan to continue training and observe the result.

@marianbasti
Copy link

Please let me know how well it goes with that batch size

@JohnHerry
Copy link

The same problem. I am using the MAGICDATA dataset, 1016 speakers, training at 1500,000~2000,000 steps got good result in trainging process. but the inference with these two model got bad speech.
@Eleanor456 Is your model good right now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants