You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all - I was wondering if anyone has ideas / has tried implementing a VAE with DDSP components. My current setup is to take the Autoencoder architecture and modify the encoder and the loss function to be variational. The decoder and processor group are unchanged from the original autoencoder. (Here's the gin file for my model.)
It is able to fit to NSynth, using SpectralLoss for the reconstruction loss. However, it doesn't seem to be able generate novel samples; when supplying an f0/loudness, sampling z's from a Gaussian and then decoding produces almost exactly the same audio each time. upon inspection, I realized that the decoder is almost entirely relying on the f0 and loudness to reconstruct the inputs, rather than the latent embedding z. Do you have any suggestions on getting it to use z more to encode the timbre? (Perhaps the fully unsupervised variant—but that seemed to not work as well in the DDSP paper?)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi all - I was wondering if anyone has ideas / has tried implementing a VAE with DDSP components. My current setup is to take the Autoencoder architecture and modify the encoder and the loss function to be variational. The decoder and processor group are unchanged from the original autoencoder. (Here's the gin file for my model.)
It is able to fit to NSynth, using
SpectralLoss
for the reconstruction loss. However, it doesn't seem to be able generate novel samples; when supplying an f0/loudness, sampling z's from a Gaussian and then decoding produces almost exactly the same audio each time. upon inspection, I realized that the decoder is almost entirely relying on the f0 and loudness to reconstruct the inputs, rather than the latent embedding z. Do you have any suggestions on getting it to usez
more to encode the timbre? (Perhaps the fully unsupervised variant—but that seemed to not work as well in the DDSP paper?)Beta Was this translation helpful? Give feedback.
All reactions