guan-yuan / Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion Public

Notifications You must be signed in to change notification settings
Fork 29
Star 408

A paper and project list about the cutting edge Speech Synthesis, Text-to-Speech (TTS), Singing Voice Synthesis (SVS), Voice Conversion (VC), Singing Voice Conversion (SVC), and related interesting works (such as Music Synthesis, Automatic Music Transcription, Automatic MOS Prediction, SSL-based ASR...etc).

408 stars 29 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
README.md		README.md

Repository files navigation

Awesome Singing Voice Synthesis and Singing Voice Conversion

A paper and project list about the cutting edge Speech Synthesis, Text-to-Speech (TTS), Singing Voice Synthesis (SVS), Voice Conversion (VC), Singing Voice Conversion (SVC), and related interesting works (such as Music Synthesis, Automatic Music Transcription, Automatic MOS Prediction, SSL-based ASR, ...etc).

Welcome to PR or contact me via email (guanyuan@gapp.nthu.edu.tw) for updating papers and works.

Paper List

Journals

IEEE/ACM TASLP, IEEE JSTSP, JSLHR, IEEE TPAMI

Conferences

NeuraIPS, ICLR, ICML, IJAI, AAAI, ACL, NAACL, EMNLP, ISMIR, ACM MM, ICASSP, INTERSPEECH, ICME

Workshops

ASRU, SLT

Singing Voice Conversion (Other Key Words: SVC, Singing Style Transfer)

[2022]

Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher | INTERSPEECH 2022 | ✔️Code | 🎧Demo
A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion | INTERSPEECH 2022 | 🎧Demo
Improving Adversarial Waveform Generation based Singing Voice Conversion with Harmonic Signals | ICASSP 2022 | 🎧Demo

[2021]

DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion | ASRU 2021 | 🎧Demo
Controllable and Interpretable Singing Voice Decomposition via Assem-VC | NeurIPS 2021 Workshop | 🎧Demo
Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding | 2021/10 | 🎧Demo
FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation | ICME 2021 | 🎧Demo
Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach | 2021/07 | ✔️Code | 🎧Demo

[2020]

Zero-shot Singing Voice Conversion | ISMIR 2020 | 🎧Demo
Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training | 2020/12 | 🎧Demo | Unofficial Code
DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System | INTERSPEECH 2020 | 🎧Demo
Unsupervised Cross-Domain Singing Voice Conversion | INTERSPEECH 2020 | 🎧Demo
PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network | ICASSP 2020 | 🎧Demo
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data | APSIPA 2020 | ✔️Code | 🎧Demo

Dateset

Singing Technique Conversion/Singing Technique Classification

[2022]

Deformable CNN and Imbalance-Aware Feature Learning for Singing Technique Classification | INTERSPEECH 2022

[2021]

Investigating Time-Frequency Representations for Audio Feature Extraction in Singing Technique Classification | APSIPA 2021
Zero-shot Singing Technique Conversion | CMMR 2021

Dateset

VocalSet: A Singing Voice Dataset | ISMIR 2018 | 🔽Apply&Download

Voice Conversion (Other Key Words: VC, Voice Cloning, Voice Style Transfer)

[2022]

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers | INTERSPEECH 2022 | 🎧Demo
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion | INTERSPEECH 2022 | 🎧Demo
Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme | ICLR 2022 | ✔️Code | 🎧Demo
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone | ICML 2022 | ✔️Code | 🎧Demo | 🎧Demo | 📝Blog
A Comparative Study of Self-supervised Speech Representation Based Voice Conversion | IEEE JSTSP 2022/07
S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations | ICASSP 2022 | ✔️Code
A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion | ICASSP 2022 | ✔️Code | 🎧Demo
Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques | ICASSP 2022 | ✔️Code | 🎧Demo
NVC-Net: End-to-End Adversarial Voice Conversion | ICASSP 2022 | ✔️Code | 🎧Demo
Robust Disentangled Variational Speech Representation Learning for Zero-Shot Voice Conversion | ICASSP 2022 | 🎧Demo
Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features | ICASSP 2022 | 🎧Demo
Toward Degradation-Robust Voice Conversion | ICASSP 2022
DGC-vector: A new speaker embedding for zero-shot voice conversion | ICASSP 2022 | 🎧Demo
End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions | 2022/05 | 🎧Demo

[2021]

On Prosody Modeling for ASR+TTS based Voice Conversion | ASRU 2021 | 🎧Demo
Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations | NeurIPS 2021 | 🎧Demo | Unofficial Code
MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features | 2021/10 | ✔️Code | 🎧Demo
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion | INTERSPEECH 2021 Best Paper Award | ✔️Code | 🎧Demo
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations | INTERSPEECH 2021 | ✔️Code | 🎧Demo
Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder | INTERSPEECH 2021 | ✔️Code | 🎧Demo
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations | INTERSPEECH 2021 | 🎧Demo
Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning | ICLR 2021
Global Rhythm Style Transfer Without Text Transcriptions | ICML 2021 | ✔️Code
AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization | ICASSP 2021 | ✔️Code | 🎧Demo
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling | IEEE/ACM TASLP 2021/05 | ✔️Code | 🎧Demo

[2020]

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning | IEEE/ACM TASLP 2020/11
Unsupervised Speech Decomposition via Triple Information Bottleneck | ICML 2020 | ✔️Code

[2019]

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization | INTERSPEECH 2019 | ✔️Code
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss | ICML 2019 | ✔️Code | 🎧Demo

Dateset

Emotional Voice Conversion

[2022]

Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion | INTERSPEECH 2022 | 🎧Demo
Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis | INTERSPEECH 2022 | 🎧Demo
Emotion Intensity and its Control for Emotional Voice Conversion | IEEE Transactions on Affective Computing 2022/07 | ✔️Code | 🎧Demo
Textless Speech Emotion Conversion using Discrete and Decomposed Representations | 202202 | 🎧Demo

[2021]

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training | INTERSPEECH 2021 | ✔️Code | 🎧Demo

[2020]

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion | INTERSPEECH 2020 | ✔️Code | 🎧Demo
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data | Odyssey 2020 | ✔️Code | 🎧Demo

Dateset

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset | ICASSP 2021 | 🔽Apply&Download | 🎧Demo

Singing Voice Synthesis (Other Key Words: SVS)

[2022]

Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis | INTERSPEECH 2022 | ✔️Code
SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy | INTERSPEECH 2022 | ✔️Code
WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses | INTERSPEECH 2022 | 🎧Demo
WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training | 2022/08 | 🎧Demo
Deep Learning Approaches in Topics of Singing Information Processing | IEEE/ACM TASLP 2022/07
Learning the Beauty in Songs: Neural Singing Voice Beautifier | ACL 2022 | ✔️Code | 🎧Demo
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism | AAAI 2022 | ✔️Code | 🎧Demo

[2021]

Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System | IEEE/ACM TASLP 2021/08 | ✔️Code

[2020]

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis | 2020/09 | 🎧Demo | Unofficial Code

Dateset

M4Singer: a Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus | NeurIPS 2022 | 🔽Apply&Download | 🎧Demo
PopCS | AAAI 2022 | 🔽Apply&Download
Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis | INTERSPEECH 2022 | 🔽Apply&Download

High-Quality Speech Synthesis (Other Key Words: Text-to-Speech, TTS)

[2022]

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech | ACM MM 2022 | ✔️Code | 🎧Demo
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis | ICLR 2022 | ✔️Code | 🎧Demo
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis | IJCAI 2022 | ✔️Code | 🎧Demo

Vocoder

[2022]

DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation | ISMIR 2022 | ✔️Code | 🎧Demo
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis | IJCAI 2022 | ✔️Code | 🎧Demo
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis | 2022/05 | 🎧Demo

[2021]

Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus | ACM MM 2021 | 🔽Apply&Download | ✔️Code | 🎧Demo
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis | INTERSPEECH 2021 | 🎧Demo
DiffWave: A Versatile Diffusion Model for Audio Synthesis | ICLR 2021 | ✔️Code | 🎧Demo
WaveGrad: Estimating Gradients for Waveform Generation | ICLR 2021 | 🎧Demo

[2020]

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | NeurIPS 2020 | ✔️Code | 🎧Demo
Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech | INTERSPEECH 2020 | 🎧Demo
Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram | ICASSP 2020 | 🎧Demo | Unofficial Code

[2019]

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis | NeurIPS 2019 | ✔️Code | 🎧Demo
Towards achieving robust universal neural vocoding | INTERSPEECH 2019 | ✔️Code | 🎧Demo | Unofficial Code

Music Synthesis/Music Synthesis

[2022]

Multi-instrument Music Synthesis with Spectrogram Diffusion | ISMIR 2022 | ✔️Code | 🎧Demo
Musika! Fast Infinite Waveform Music Generation | ISMIR 2022 | ✔️Code | 🎧Demo

Automatic Music Transcription

[2022]

MT3: Multi-Task Multitrack Music Transcription | ICLR 2022 | ✔️Code |

[2021]

Omnizart: A General Toolbox for Automatic Music Transcription | The Open Journal 2021/12 | ✔️Code | 🎧Demo

Self-supervised/Unsupervised ASR

[2022]

UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training | ICASSP 2022 | ✔️Code | ✔️Code
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition | ICASSP 2022 | ✔️Code | ✔️Code
Pseudo-Labeling for Massively Multilingual Speech Recognition | ICASSP 2022 | ✔️Code | ✔️Code
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing | IEEE JSTSP 2022/06 | ✔️Code | ✔️Code

[2021]

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale | 2021/12 | ✔️Code | ✔️Code
Simple and Effective Zero-shot Cross-lingual Phoneme Recognition | 2021/09 | ✔️Code | ✔️Code
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech | IEEE/ACM TASLP 2021/08 | ✔️Code
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data | ICML 2021 | ✔️Code | ✔️Code | ✔️Code
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units | IEEE/ACM TASLP 2021/06 | ✔️Code | ✔️Code

[2020]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations | NeurIPS 2020 | ✔️Code | ✔️Code
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations | ICLR 2020 | ✔️Code | ✔️Code
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders | ICASSP 2020 | ✔️Code
Unsupervised Cross-lingual Representation Learning for Speech Recognition | 2020/06 | ✔️Code | ✔️Code
fairseq S2T: Fast Speech-to-Text Modeling with fairseq | AACL 2020 | ✔️Code | ✔️Code

[2019]

Representation Learning with Contrastive Predictive Coding | 2019/07 | ✔️Code

Automatic MOS Prediction

[2022]

The VoiceMOS Challenge 2022 | INTERSPEECH 2022

[2021]

Utilizing Self-supervised Representations for MOS Prediction | INTERSPEECH 2021 | ✔️Code

Speech Data Augmentation

[2021]

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain | SLT 2021 | ✔️Code

Speech Insertion

[2022]

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion | INTERSPEECH 2022 | 🎧Demo

Speech Enhancement

[2022]

Conditional Diffusion Probabilistic Model for Speech Enhancement | ICASSP 2022 | ✔️Code

[2021]

A Study on Speech Enhancement Based on Diffusion Probabilistic Model | APSIPA 2021

Prosody-Aware

[2022]

Text-Free Prosody-Aware Generative Spoken Language Modeling | ACL 2022 | ✔️Code | 🎧Demo

[2021]

Speech BERT Embedding For Improving Prosody in Neural TTS | ICASSP 2021 | ✔️Code | 🎧Demo

Adversarial Attack

[2021]

Defending Your Voice: Adversarial Attack on Voice Conversion | SLT 2021 | ✔️Code | 🎧Demo

Toolkits

ASR Toolkits

TTS Toolkits

Audio/Music Processing Toolkits

Data Annotation/Alignment/ Toolkits

Other Frameworks and Toolkits

Competitions

References

About

A paper and project list about the cutting edge Speech Synthesis, Text-to-Speech (TTS), Singing Voice Synthesis (SVS), Voice Conversion (VC), Singing Voice Conversion (SVC), and related interesting works (such as Music Synthesis, Automatic Music Transcription, Automatic MOS Prediction, SSL-based ASR...etc).

music text-to-speech speech pytorch tts speech-synthesis music-generation voice-conversion music-synthesis singing-voice automatic-music-transcription singing-synthesis music-transcription diffusion-models singing-voice-synthesis singing-voice-conversion mos-prediction

Report repository

Releases

No releases published

Packages

No packages published