Replies: 13 comments
-
Hi, I've experimented very briefly with this and tended to get better
results than what you've shown so far ("ooohh ahhhh"s) but that was at
16kHz, which might help it not focus too much on high frequencies. if
you're doing 44100, you'll want to also use much larger fft sizes for the
loss functions to get the frequency resolution. In general voice conversion
models also condition on the phonemes as well as pitch and loudness, to
control what is being said, but the system is not set up / optimized
currently for voice.
…On Mon, Nov 16, 2020 at 3:54 PM Luke S ***@***.***> wrote:
Thanks for the awesome work here. Wondering if you have experimented doing
timbre transfer with a singing voice as the target rather than an
instrument? If yes, any general guidance you could offer re:
parameterization or other? I've made an attempt with similar parameters
from the single instrument model, though the sample rate is significantly
higher (44100 vs. 16k) and the preprocessor's time_steps parameter is
bumped up to 1200 to account for upsampling to greater number of samples.
Comparing original audio to resynthesized audio from the training set, the
relative pitches are all quite, but the resynthesized audio takes on a very
grainy, highly "artificial" quality. Here's an example --
audio_gen.wav.zip
<https://github.com/magenta/ddsp/files/5550348/audio_gen.wav.zip>
Thanks.
Luke
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/magenta/ddsp/issues/265>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANFCCPE2EDLWNBDQJFAMK3SQG3UNANCNFSM4TX3EHPA>
.
|
Beta Was this translation helpful? Give feedback.
-
This fork has an adaptation of DDSP to singing voices: https://github.com/gianmarcohutter/ddsp_gm2. It uses conditioning on phonemes. |
Beta Was this translation helpful? Give feedback.
-
Thanks, @jesseengel. 16k is giving good initial results for resynthesis. May revisit this thread w/ questions and/or results as I attempt to scale up fidelity. Also, appreciate the link @voodoohop. Going to check that out. |
Beta Was this translation helpful? Give feedback.
-
Hi @jesseengel -- wondering what kind of vocalization you were using the elicited the ooh's and ahh's? I.e. were these things like sung arpeggios and scales or vocalizations with discrete words / lyrics? |
Beta Was this translation helpful? Give feedback.
-
Slow, drawn out singing, which made it mostly vowels.
…On Wed, Nov 18, 2020 at 7:58 AM Luke S ***@***.***> wrote:
Hi @jesseengel <https://github.com/jesseengel> -- wondering what kind of
vocalization you were using the elicited the ooh's and ahh's? I.e. were
these things like sung arpeggios and scales or vocalizations with discrete
words / lyrics?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/magenta/ddsp/issues/265#issuecomment-729774736>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANFCCNZBG7SUEWS6DHU4U3SQPVJTANCNFSM4TX3EHPA>
.
|
Beta Was this translation helpful? Give feedback.
-
https://github.com/gianmarcohutter/ddsp_gm2 |
Beta Was this translation helpful? Give feedback.
-
https://github.com/gianmarcohutter/ddsp_gm2 |
Beta Was this translation helpful? Give feedback.
-
So I have a dream of "Karaoke on Steroids" where any amateur singer can sound like a professional using real time timbre transfer and perhaps some autotune. |
Beta Was this translation helpful? Give feedback.
-
Just saw that people were talking about my work:) I did try to integrate conditioning on phonemes to better deal with singing voices and the results are intelligible but not yet satisfactory from a musical point of view and not really recognizable as the person it has been trained on. I think the approach to use phonemes might be right, but the phoneme detection I used (CMU Sphinx) has a quite high error rate. Unfortunately there arent many python compatible libraries out there for speech recognition that allow accessing the phonemes so I tried to make do with what I found. With the help of the Z-encoder (and not using the reverb module) the network already does a great job in resynthesizing a voice from training set (when you let it overfit) which means the harmonic + noise synthesis pipeline is quite capable of synthesizing voices if the encoding is done in a better way. If someone is working on this for research (just making some really good samples instead of making it usable for everyday usage) then i recommend looking at backtranslation as done in "Unsupervised singing voice conversion" by E. Nachmani and L. Wolf. |
Beta Was this translation helpful? Give feedback.
-
The dictionary in CMU Sphinx is set to english to take into account the commonness of each phoneme but in practice it does not make much of a difference which language it is. (since the error rate is quite high anyway as mentioned above) |
Beta Was this translation helpful? Give feedback.
-
Eventhough the DDSP approach is much much faster than WaveNet based approaches it will be quite hard (if not impossible) to get it to do live performances. I would love to see this happening but it's a long way to get there. The autotune on the other hand is already built in since the pitch is directly encoded:) |
Beta Was this translation helpful? Give feedback.
-
Hi, I'm going to move this thread over to discussions, but just to let you know we just got an early prototype of DDSP VST working and should release a first version early 2021 |
Beta Was this translation helpful? Give feedback.
-
Wanted to drop a comment here asking if anyone knows the latest on voice based timbre transfer? I have been playing with the DDSP VST by Magenta but it seems to work best with instrument still. Is https://github.com/gianmarcohutter/ddsp_gm2 still the best place to start when it comes to timbre transfer on voices? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the awesome work here. Wondering if you have experimented doing timbre transfer with a singing voice as the target rather than an instrument? If yes, any general guidance you could offer re: parameterization or other? I've made an attempt with similar parameters from the single instrument model, though the sample rate is significantly higher (44100 vs. 16k) and the preprocessor's time_steps parameter is bumped up to 1200 to account for upsampling to greater number of samples.
Comparing original audio to resynthesized audio from the training set, the relative pitches are all quite, but the resynthesized audio takes on a very grainy, highly "artificial" quality. Here's an example --
audio_gen.wav.zip
Thanks.
Luke
Beta Was this translation helpful? Give feedback.
All reactions