timbre transfer with singer as target #265

lstrgar · 2020-11-16T23:54:32Z

lstrgar
Nov 16, 2020

Thanks for the awesome work here. Wondering if you have experimented doing timbre transfer with a singing voice as the target rather than an instrument? If yes, any general guidance you could offer re: parameterization or other? I've made an attempt with similar parameters from the single instrument model, though the sample rate is significantly higher (44100 vs. 16k) and the preprocessor's time_steps parameter is bumped up to 1200 to account for upsampling to greater number of samples.

Comparing original audio to resynthesized audio from the training set, the relative pitches are all quite, but the resynthesized audio takes on a very grainy, highly "artificial" quality. Here's an example --

audio_gen.wav.zip

Thanks.
Luke

jesseengel · 2020-11-17T02:23:10Z

jesseengel
Nov 17, 2020
Maintainer

Hi, I've experimented very briefly with this and tended to get better results than what you've shown so far ("ooohh ahhhh"s) but that was at 16kHz, which might help it not focus too much on high frequencies. if you're doing 44100, you'll want to also use much larger fft sizes for the loss functions to get the frequency resolution. In general voice conversion models also condition on the phonemes as well as pitch and loudness, to control what is being said, but the system is not set up / optimized currently for voice.

…

On Mon, Nov 16, 2020 at 3:54 PM Luke S ***@***.***> wrote: Thanks for the awesome work here. Wondering if you have experimented doing timbre transfer with a singing voice as the target rather than an instrument? If yes, any general guidance you could offer re: parameterization or other? I've made an attempt with similar parameters from the single instrument model, though the sample rate is significantly higher (44100 vs. 16k) and the preprocessor's time_steps parameter is bumped up to 1200 to account for upsampling to greater number of samples. Comparing original audio to resynthesized audio from the training set, the relative pitches are all quite, but the resynthesized audio takes on a very grainy, highly "artificial" quality. Here's an example -- audio_gen.wav.zip <https://github.com/magenta/ddsp/files/5550348/audio_gen.wav.zip> Thanks. Luke — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/magenta/ddsp/issues/265>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANFCCPE2EDLWNBDQJFAMK3SQG3UNANCNFSM4TX3EHPA> .

0 replies

voodoohop · 2020-11-17T18:36:04Z

voodoohop
Nov 17, 2020

This fork has an adaptation of DDSP to singing voices: https://github.com/gianmarcohutter/ddsp_gm2. It uses conditioning on phonemes.

0 replies

lstrgar · 2020-11-17T20:49:14Z

lstrgar
Nov 17, 2020
Author

Thanks, @jesseengel. 16k is giving good initial results for resynthesis. May revisit this thread w/ questions and/or results as I attempt to scale up fidelity. Also, appreciate the link @voodoohop. Going to check that out.

0 replies

lstrgar · 2020-11-18T15:58:01Z

lstrgar
Nov 18, 2020
Author

Hi @jesseengel -- wondering what kind of vocalization you were using the elicited the ooh's and ahh's? I.e. were these things like sung arpeggios and scales or vocalizations with discrete words / lyrics?

0 replies

jesseengel · 2020-11-19T00:26:13Z

jesseengel
Nov 19, 2020
Maintainer

Slow, drawn out singing, which made it mostly vowels.

…

On Wed, Nov 18, 2020 at 7:58 AM Luke S ***@***.***> wrote: Hi @jesseengel <https://github.com/jesseengel> -- wondering what kind of vocalization you were using the elicited the ooh's and ahh's? I.e. were these things like sung arpeggios and scales or vocalizations with discrete words / lyrics? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/magenta/ddsp/issues/265#issuecomment-729774736>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANFCCNZBG7SUEWS6DHU4U3SQPVJTANCNFSM4TX3EHPA> .

0 replies

molo32 · 2020-12-08T03:16:12Z

molo32
Dec 8, 2020

https://github.com/gianmarcohutter/ddsp_gm2
is this used to imitate any voice?
works for any language?

0 replies

molo32 · 2020-12-08T03:17:45Z

molo32
Dec 8, 2020

Esta bifurcación tiene una adaptación de DDSP para cantar voces: https://github.com/gianmarcohutter/ddsp_gm2 . Utiliza el condicionamiento de los fonemas.

https://github.com/gianmarcohutter/ddsp_gm2
is this used to imitate any voice?
works for any language?

0 replies

Mike-Journey · 2020-12-10T01:43:54Z

Mike-Journey
Dec 10, 2020

So I have a dream of "Karaoke on Steroids" where any amateur singer can sound like a professional using real time timbre transfer and perhaps some autotune.
Is it possible to do timbre transfer in near "real time"?

0 replies

gianmarcohutter · 2020-12-17T19:40:38Z

gianmarcohutter
Dec 17, 2020

Just saw that people were talking about my work:) I did try to integrate conditioning on phonemes to better deal with singing voices and the results are intelligible but not yet satisfactory from a musical point of view and not really recognizable as the person it has been trained on. I think the approach to use phonemes might be right, but the phoneme detection I used (CMU Sphinx) has a quite high error rate. Unfortunately there arent many python compatible libraries out there for speech recognition that allow accessing the phonemes so I tried to make do with what I found. With the help of the Z-encoder (and not using the reverb module) the network already does a great job in resynthesizing a voice from training set (when you let it overfit) which means the harmonic + noise synthesis pipeline is quite capable of synthesizing voices if the encoding is done in a better way. If someone is working on this for research (just making some really good samples instead of making it usable for everyday usage) then i recommend looking at backtranslation as done in "Unsupervised singing voice conversion" by E. Nachmani and L. Wolf.

0 replies

gianmarcohutter · 2020-12-17T19:44:57Z

gianmarcohutter
Dec 17, 2020

https://github.com/gianmarcohutter/ddsp_gm2
is this used to imitate any voice?
works for any language?

The dictionary in CMU Sphinx is set to english to take into account the commonness of each phoneme but in practice it does not make much of a difference which language it is. (since the error rate is quite high anyway as mentioned above)

0 replies

gianmarcohutter · 2020-12-17T19:51:02Z

gianmarcohutter
Dec 17, 2020

So I have a dream of "Karaoke on Steroids" where any amateur singer can sound like a professional using real time timbre transfer and perhaps some autotune.
Is it possible to do timbre transfer in near "real time"?

Eventhough the DDSP approach is much much faster than WaveNet based approaches it will be quite hard (if not impossible) to get it to do live performances. I would love to see this happening but it's a long way to get there. The autotune on the other hand is already built in since the pitch is directly encoded:)

0 replies

jesseengel · 2020-12-22T20:08:16Z

jesseengel
Dec 22, 2020
Maintainer

Hi, I'm going to move this thread over to discussions, but just to let you know we just got an early prototype of DDSP VST working and should release a first version early 2021

0 replies

joeymorello · 2022-11-16T17:09:40Z

joeymorello
Nov 16, 2022

Wanted to drop a comment here asking if anyone knows the latest on voice based timbre transfer? I have been playing with the DDSP VST by Magenta but it seems to work best with instrument still. Is https://github.com/gianmarcohutter/ddsp_gm2 still the best place to start when it comes to timbre transfer on voices?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timbre transfer with singer as target #265

{{title}}

Replies: 13 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

timbre transfer with singer as target #265

Replies: 13 comments

jesseengel Nov 17, 2020 Maintainer

lstrgar Nov 17, 2020 Author

lstrgar Nov 18, 2020 Author

jesseengel Nov 19, 2020 Maintainer

jesseengel Dec 22, 2020 Maintainer

jesseengel
Nov 17, 2020
Maintainer

lstrgar
Nov 17, 2020
Author

lstrgar
Nov 18, 2020
Author

jesseengel
Nov 19, 2020
Maintainer

jesseengel
Dec 22, 2020
Maintainer