Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In stylemelgan, G net and D net joint training, voice quality decreased #350

Open
huhuqwaszxedc opened this issue Apr 5, 2022 · 4 comments
Labels
question Further information is requested

Comments

@huhuqwaszxedc
Copy link

   Hi,sir,I have a problem,and Could you help me?I choose the generative network of stylemelgan and the discrimination of melgan. When I pre train the generated network, the voice quality is improved, but after GD joint training, the voice quality decreases. Is D network holding back? Moreover, the generated voice of my sentence is very short. Under 16KHz sampling, there are only 2880 sample points in a sentence. Will 4 times downsampling pooling layers in D network mislead the generation of G network?
  Here is my training loss

图片

@kan-bayashi kan-bayashi added the question Further information is requested label Apr 5, 2022
@kan-bayashi
Copy link
Owner

I could not give a comment from only this figure.
Please attach your config and share your dataset detail.

@huhuqwaszxedc
Copy link
Author

huhuqwaszxedc commented Apr 6, 2022

Thank you very much for your reply. Based on your source code, I try to use stylemelgan generator and melgan multi-scale discriminatior to speech packet loss concealment. The dataset selects 11000 speech sentences from librispeech.

sampling_rate: 16000     # Sampling rate.
fft_size: 1024           # FFT size.
hop_size: 160            # Hop size.
win_length: null         # Window length.
                         # If set to null, it will be the same as fft_size.
window: "hann"           # Window function.
num_mels: 80             # Number of mel basis.
fmin: 80                 # Minimum freq in mel basis calculation.
fmax: 7600               # Maximum frequency in mel basis calculation.
global_gain_scale: 1.0   # Will be multiplied to all of waveform.
trim_silence: true       # Whether to trim the start and end of silence.
trim_threshold_in_db: 20 # Need to tune carefully if the recording is not good.
trim_frame_size: 1024    # Frame size in trimming.
trim_hop_size: 160       # Hop size in trimming.

discriminator_type: "MelGANMultiScaleDiscriminator" # Discriminator type.
discriminator_params:
    in_channels: 1                    # Number of input channels.
    out_channels: 1                   # Number of output channels.
    scales: 3                         # Number of multi-scales.
    downsample_pooling: "AvgPool1d"   # Pooling type for the input downsampling.
    downsample_pooling_params:        # Parameters of the above pooling function.
        kernel_size: 4
        stride: 2
        padding: 1
        count_include_pad: False
    kernel_sizes: [5, 3]              # List of kernel size.
    channels: 16                      # Number of channels of the initial conv layer.
    max_downsample_channels: 512      # Maximum number of channels of downsampling layers.
    downsample_scales: [4, 4, 4]      # List of downsampling scales.
    nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
    nonlinear_activation_params:      # Parameters of nonlinear activation function.
        negative_slope: 0.2
    use_weight_norm: True             # Whether to use weight norm.

generator_type: "StyleMelGANGenerator" # Generator type.
generator_params:
    in_channels: 128
    aux_channels: 80
    channels: 64
    out_channels: 1
    kernel_size: 9
    dilation: 2
    bias: True
    noise_upsample_scales: [10, 2, 2, 2]
    noise_upsample_activation: "LeakyReLU"
    noise_upsample_activation_params:
        negative_slope: 0.2
    upsample_scales: [5, 1, 2, 1, 2, 2, 2, 2]
    upsample_mode: "nearest"
    gated_function: "softmax"
    use_weight_norm: True

**batch_size: 32              # Batch size.
batch_max_steps: 2880      # Length of each audio in batch. Make sure dividable by hop_size.
pin_memory: true            # Whether to pin memory in Pytorch DataLoader.**

stft_loss_params:
    fft_sizes: [1024, 2048, 512]  # List of FFT size for STFT-based loss.
    hop_sizes: [120, 240, 50]     # List of hop size for STFT-based loss
    win_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
    window: "hann_window"         # Window function for STFT-based loss
use_subband_stft_loss: true
subband_stft_loss_params:
    fft_sizes: [384, 683, 171]  # List of FFT size for STFT-based loss.
    hop_sizes: [30, 60, 10]     # List of hop size for STFT-based loss
    win_lengths: [150, 300, 60] # List of window length for STFT-based loss.
    window: "hann_window"       # Window function for STFT-based loss

use_feat_match_loss: false # Whether to use feature matching loss.
lambda_adv: 3            # Loss balancing coefficient for adversarial loss.

@kan-bayashi
Copy link
Owner

  • batch_max_steps seems too short.
  • What is your intension of the use of different discriminator? Did you try the default combination? If not, you should try it at first.

@huhuqwaszxedc
Copy link
Author

huhuqwaszxedc commented Apr 12, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants