FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion

This is the official code implementation of FreeSVC [ICASSP 2025]

Introduction

FreeSVC is a multilingual singing voice conversion model that converts singing voices across different languages. It leverages an enhanced VITS model integrated with Speaker-invariant Clustering (SPIN) and the ECAPA2 speaker encoder to effectively separate speaker characteristics from linguistic content. Designed for zero-shot learning, FreeSVC supports cross-lingual voice conversion without the need for extensive language-specific training.

Key Features

Multilingual Support: Incorporates trainable language embeddings, enabling effective handling of multiple languages without extensive language-specific training.
Advanced Speaker Encoding: Utilizes the State-of-the-Art (SOTA) speaker encoder ECAPA2 to disentangle speaker characteristics from linguistic content, ensuring high-quality voice conversion.
Zero-Shot Learning: Allows cross-lingual singing voice conversion even with unseen speakers, enhancing versatility and applicability.
Enhanced VITS Model with SPIN: Improves content representation for more accurate and natural voice conversion.
Optimized Cross-Language Conversion: Demonstrates the importance of a multilingual content extractor for achieving optimal performance in cross-language voice conversion tasks.

Model Architecture

FreeSVC builds upon the VITS architecture, integrating several key components:

Content Extractor: Utilizes SPIN, an enhanced version of ContentVec based on HuBERT, to extract linguistic content while separating speaker timbre.
Speaker Encoder: Employs ECAPA2 to capture unique speaker characteristics, ensuring accurate disentanglement from linguistic content.
Pitch Extractor: Uses RMVPE to robustly extract vocal pitches from polyphonic music, preserving the original melody.
Language Embeddings: Incorporates trainable language embeddings to condition the model for multilingual training and conversion.

Figure 1: Comprehensive diagram of the FreeSVC model illustrating the training and inference procedures.

Dataset

FreeSVC is trained on a diverse set of speech and singing datasets covering multiple languages:

Dataset	Hours	Speakers	Language	Type
AISHELL-1	170h	214 F, 186 M	Chinese	Speech
AISHELL-3	85h	176 F, 42 M	Chinese	Speech
CML-TTS	3.1k	231 F, 194 M	7 Languages	Speech
HiFiTTS	292h	6 F, 4 M	English	Speech
JVS	30h	51 F, 49 M	Japanese	Speech
LibriTTS-R	585h	2,456	English	Speech
NUS (NHSS)	7h	5 F, 5 M	English	Both
OpenSinger	50h	41 F, 25 M	Chinese	Singing
Opencpop	5h	1 F	Chinese	Singing
PopBuTFy	10h, 40h	12, 22	Chinese, English	Singing
POPCS	5h	1 F	Chinese	Singing
VCTK	44h	109	English	Speech
VocalSet	10h	11 F, 9 M	Various	Singing

Getting Started

Clone the Repository:

git clone https://github.com/freds0/free-svc.git
cd free-svc

Create a Docker Image:
- Build the Docker image using the provided Dockerfile:
```
docker build -t freesvc .
```
Run the Docker Container:
- Start the Docker container and mount the current directory:
```
docker run -it --rm -v "$(pwd)":/workspace freesvc
```
Prepare the Dataset:
- Execute the dataset preparation script:
```
bash prepare_{name}_dataset.sh
```
  Replace {name} with the appropriate dataset identifier.
Download Required Models:
- WavLM Large Model:
  - Download from WavLM GitHub Repository.
  - Place the downloaded model in models/wavlm/.
- HifiGAN Model:
  - Download from HifiGAN GitHub Repository.
  - Place the downloaded model in models/hifigan/.
Install Dependencies:
```
pip install -r requirements.txt
```
Train the Model:
- Run the training script with the appropriate configuration:
```
python train.py --config-dir configs --config-name sovits-online_hubert data.dataset_dir={dataset_dir}
```
  Replace {dataset_dir} with the path to your dataset directory.

Audio Conversion

This section explains how to use the FreeSVC model for audio conversion.

python scripts/inference.py --hpfile path/to/config.yaml \
                   --ptfile path/to/checkpoint.pth \
                   --input-base-dir path/to/input/directory \
                   --metadata-path path/to/metadata.csv \
                   --spk-emb-base-dir path/to/speaker/embeddings \
                   --out-dir path/to/output_directory \
                   [--use-vad] \
                   [--use-timestamp] \
                   [--concat-audio] \
                   [--pitch-factor PITCH_FACTOR]

Parameters:

--hpfile: Path to the configuration YAML file
--ptfile: Path to the model checkpoint file
--input-base-dir: Base directory containing source audio files
--metadata-path: Path to the CSV metadata file
--spk-emb-base-dir: Directory containing speaker embeddings
--out-dir: Output directory for converted audio (default: "gen-samples/")

Optional Parameters:

--pitch-predictor: Pitch predictor model type (default: "rmvpe")
--use-vad: Enable Voice Activity Detection for better segmentation
--use-timestamp: Add timestamps to output filenames
--concat-audio: Concatenate all converted segments into a single file
--pitch-factor: Adjust pitch modification factor (default: 0.9544)
--ignore-metadata-header: Skip first row of metadata CSV (default: True)

Metadata CSV Format:

source_path|source_lang|source_speaker|target_speaker|target_lang
./audio/source1.wav|en|speaker1|speaker2|ja
./audio/source2.wav|zh|speaker3|speaker4|en

Required columns:

source_path: Path to source audio file (relative to input-base-dir)
source_lang: Source language code (e.g., 'en', 'ja', 'zh')
source_speaker: Source speaker identifier
target_speaker: Target speaker identifier
target_lang: Target language code
transcript: (Optional) Text transcript of the audio

Output Directory Structure The converted audio files will be organized in the following structure:

output_dir/
└── metadata_name/
    └── source_lang/
        └── target_lang/
            └── source_speaker/
                └── target_speaker/
                    └── converted_audio.wav

Additional Notes

Voice Activity Detection (VAD): When VAD is enabled using the --use-vad flag, the system performs intelligent speech segmentation on the input audio. It automatically detects and isolates speech segments for processing while maintaining non-speech portions of the audio. Each detected speech segment is processed independently, and the system then reconstructs the full audio by concatenating all segments in their original order. This approach ensures high-quality conversion while preserving the natural rhythm and timing of the original audio.
Pitch Adjustment: The system offers precise pitch control through the --pitch-factor parameter. This factor acts as a multiplier for the output pitch. Users can fine-tune this parameter to achieve the desired pitch characteristics in the converted audio.
Audio Concatenation: The --concat-audio option provides a convenient way to combine multiple conversions into a single audio file. When enabled, the system will automatically merge all converted segments into one continuous audio file, saved as "all.wav" in the output directory. This feature is particularly useful when processing multiple short segments that belong together or when creating a compilation of converted audio.

Pretrained Models

The pretrained weights for FreeSVC are available on the Hugging Face Model Hub at alefiury/free-svc.

Available Files

To utilize the pretrained models, download the following files to your local machine:

License

This project is licensed under the MIT License.

Citation

@misc{ferreira2025freesvczeroshotmultilingualsinging,
      title={FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion}, 
      author={Alef Iury Siqueira Ferreira and Lucas Rafael Gris and Augusto Seben da Rosa and Frederico Santos de Oliveira and Edresson Casanova and Rafael Teixeira Sousa and Arnaldo Candido Junior and Anderson da Silva Soares and Arlindo Galvão Filho},
      year={2025},
      eprint={2501.05586},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2501.05586}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
configs		configs
models		models
resources		resources
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
data_utils.py		data_utils.py
features.py		features.py
losses.py		losses.py
mel_processing.py		mel_processing.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion

Introduction

Key Features

Model Architecture

Dataset

Getting Started

Audio Conversion

Additional Notes

Pretrained Models

Available Files

License

Citation

Acknowledgements

About

Releases

Packages

Contributors 4

Languages

License

freds0/free-svc

Folders and files

Latest commit

History

Repository files navigation

FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion

Introduction

Key Features

Model Architecture

Dataset

Getting Started

Audio Conversion

Additional Notes

Pretrained Models

Available Files

License

Citation

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages