-
-
Notifications
You must be signed in to change notification settings - Fork 137
RVC (Retrieval‐based Voice Conversion)
RVC enhances TTS by replicating voice characteristics for characters or narrators, adding depth to synthesized speech. It functions as a TTS-to-TTS pipeline and can be used with any TTS engine/model. For optimal performance, it's recommended to use a voice cloning TTS engine like Coqui XTTS with voice samples.
You will need to first Enable RVC
in the Global Settings > RVC Settings
tab and click the Update RVC Settings
button, AllTalk will create the necessary folders and download any missing model files required for RVC to work.
- Store voice models in the
/models/rvc_voices/{subfolder}
directory in their own individual subfolder. Thervc_voices
folder is created when RVC is enabled in the Gradio interface. - A voice model typically includes a PTH file and potentially an index file.
- If an index file is present, AllTalk will automatically select and use it.
- If multiple index files are found, none will be used, and a message will be output to the console.
- You can find 100,000+ pre-generated RVC voice models on sites like voice-models.com and Hugging Face.
- There is currently no RVC voice model creation within AllTalk, it is on the TODO list (should be in the next release).
📁 models
└── 📁 rvc_voices
├── 📁 voice_model_1
│ ├── model.pth
│ └── index.json
└── 📁 voice_model_2
│── model.pth
└── index.json
The index file helps improve the quality of the generated audio by providing a reference during the conversion process. The FAISS index enables faster and more accurate retrieval of voice characteristics, leading to more natural and high-quality voice synthesis.
AllTalk implements LRU (Least Recently Used) caching for RVC models and embedders to optimize performance. The system caches up to 3 voice models in memory, automatically removing the least recently used model when loading a new one. This means:
- Frequently used voice models stay in memory, reducing load times
- When a fourth model is loaded, the least recently used model is unloaded
- Embedder models (hubert/contentvec) are cached separately
This caching system helps balance memory usage with performance, particularly beneficial when using the same voices repeatedly in a session.
- Selects the voice model used for character conversion.
- If "Disabled" is selected, RVC will not be applied to character voices.
- This option is used only if RVC is enabled and no other voice is specified in the API request.
- Selects the voice model used for narrator conversion.
- If "Disabled" is selected, RVC will not be applied to the narrator voice.
- This option is used only if RVC is enabled and no other voice is specified in the API request.
- Sets the influence exerted by the index file on the final output.
- A higher value increases the impact of the index, potentially enhancing detail but also increasing the risk of artifacts.
- Sets the pitch of the audio output.
- Increasing the value raises the pitch, while decreasing the value lowers it.
- Substitutes or blends with the volume envelope of the output.
- A ratio closer to 1 means the output envelope is more heavily employed.
- Prevents artifacts in voiceless consonants and breath sounds.
- Higher values (up to 0.5) provide stronger protection but might affect indexing.
- Enables or disables auto-tune for the generated audio.
- Recommended for singing conversions to ensure the output remains in tune.
- If the number is greater than or equal to three, employing median filtering on the collected tone results has the potential to decrease respiration.
- Determines the number of training data points used to train the FAISS index.
- Increasing the size may improve the quality of the output but can also increase computation time.
- Different index files have different sizes. This setting limits the maximum amount of the index used.
- Select between different models for learning speaker embedding.
- Options:
- hubert: Focuses on capturing phonetic and linguistic content.
- contentvec: Captures more detailed voice characteristics and nuances.
- Splits the audio into chunks for inference to obtain better results in some cases.
- Can improve the quality of conversion, especially for longer audio inputs.
- Choose the algorithm used for extracting the pitch (F0) during audio conversion.
- Options include:
- crepe: High accuracy, robust against noise.
- crepe-tiny: Smaller, faster version of crepe with slightly reduced accuracy.
- dio: Fast, less accurate, suitable for real-time applications.
- fcpe: Focuses on precise pitch extraction.
- harvest: Produces smooth and natural pitch contours.
- hybrid[rmvpe+fcpe]: Combines strengths of rmvpe and fcpe.
- pm: Robust algorithm with a balance of speed and accuracy.
- rmvpe: Recommended for most cases, especially in TTS applications.