-
-
Notifications
You must be signed in to change notification settings - Fork 137
Guide to Integrating New TTS Engines into AllTalk
This guide describes how to integrate a new Text-to-Speech (TTS) engine into the AllTalk framework. The integration process involves creating and modifying several files that work together to provide a consistent interface between AllTalk and the new TTS engine. In this guide, I will use [engine_name]
as the placeholder for the TTS engine you are integrating. Please use the same CAPS/Non-Caps spelling throughout your code and folder names for [engine_name]
as this is important. To clarify [engine_name]
= xtts, it must be "xtts" everywhere, not "XTTS" or "Xtts".
This guide and the template files may seem overwhelming at first glance, however, they have been designed to be as simple as possible to work with. This guide is quite large, but should be used as a reference point if ever needed. Additionally, the template files for adding a new engine contain instructions throughout and indicators where you should or shouldn't change code & also what that code would need to be e.g.
💡 Tip: I highly suspect you will be able to copy/paste this help guide, the files you need to update & the new TTS engine's GitHub page into ChatGPT or similar and it will be able to help you through the entire process from start to finish.
💡 Tip: If at any time you are uncertain what data a specific function should be returning, you can always check the API guides on the GitHub Wiki or even better, the function from another existing AllTalk TTS engine.
📁 alltalk_tts/
├── 📁 .GitHub/
├── 📁 alltalk_environment/ # AllTalk's Python environment folder
├── 📁 finetune/
├── 📁 models/ # 🚨 TTS Engines model files are stored in here
│ ├── 📁 f5tts/
│ ├── 📁 piper/
│ ├── 📁 xtts/
│ ├── 📁 rvc_base/
│ ├── 📁 rvc_voices/
│ ├── 📁 xtts/
│ ├── 📁 vits/
│ ├── 📁 [engine_name]/ # 🚨 Your new engine name's model files folder
│ └── etc.../
├── 📁 system/
│ ├── 📁 .....
│ ├── 📁 requirements/ # Requirement files
│ ├── 📁 TGWUI Extension/
│ └── 📁 tts_engines/ # Individual TTS engine's core code
│ ├── 📁 f5tts/
│ ├── 📁 parler/
│ ├── 📁 piper/
│ ├── 📁 rvc/
│ ├── 📁 template-tts-engine/ # 🚨Template code for adding a new TTS engine
│ │ ├── model_engine.py
│ │ ├── model_settings.json
│ │ ├── help_content.py
│ │ ├── [engine_name]_settings_page.py
│ │ └── available_models.json
│ ├── 📁 vits/
│ ├── 📁 xtts/
│ ├── 🗎 tts_engines.json # TTS engine configuration file
│ └── 🗎 new_engines.json # New TTS engine configuration file
├── 📁 voices/ # Audio samples for voice cloning engines are stored in here.
├── 📁 outputs/ # TTS output audio files
├── 🗎 confignew.json
├── 🗎 etc...
├── 🗎 script.py # Main start-up script
└── 🗎 tts_server.py # Engine management script
- You will copy the
template-tts-engine
folder to a new folder insidetts_engines/[engine_name]
- You will change the
[engine_name]
of[engine_name]_settings_page
to match your new TTS engine name - You will update:
-
model_engine.py
adding code to find models, voices, generate TTS, handle loading of models etc. -
model_settings.json
to store the settings about that TTS engine -
available_models.json
to store lists of all models or voice models that can be downloaded from the Gradio UI -
[engine_name]_settings_page
&help_content.py
to present the TTS engines UI settings, model or voice model downloader, help sections etc to the Gradio UI -
new_engines.json
to import the engine on the AllTalk's next start-up
-
💡 Tip: Most of the code and setup inside model_engine.py
, [engine_name]_settings_page
& help_content.py
is pre-built and ready to go.
-
Core AllTalk Server (
tts_server.py
)- Acts as the main interface between the web UI/API and TTS engines
- Loads in the selected TTS engine as a Class
- Handles routing of TTS requests to the appropriate engine
- Manages voice generation queues and system settings
- You DO NOT need to touch or alter this file
- Location:
/alltalk_tts/
-
Engine Layer (
model_engine.py
)- Individual engine implementations, the one that is imported as a Class by
tts_server.py
- Handles model loading, unloading, and voice generation
- Provides standardized interface for the core server
- The pre-existing functions/variables within the file need to be there, do not remove them
- You can add any helper functions you want into the script to perform tasks e.g. maybe your generate TTS function uses WAV files but needs them to be 22050Hz so you create a helper function to test and down sample 44100hz wav files as/when needed.
-
tts_server.py
looks for and works with these pre-existing functions/variables - You will be working on this file
- Location:
/system/tts_engines/[engine_name]/model_engine.py
- Individual engine implementations, the one that is imported as a Class by
-
Engine Settings JSON (
model_settings.json
)- Stores a group of settings that
model_engine.py
&modelname_settings_page.py
needs to know about the engine - You can extend this JSON file if needed to store your own specific model settings that the model engine and settings page can use, but don't remove the pre-existing settings, just update them as necessary
- You will be working on this file
- Location:
/system/tts_engines/[engine_name]/model_settings.json
- Stores a group of settings that
-
Engine's downloadable Models/Voices (
available_models.json
)- Stores a list of all known models or voice models that can be downloaded
- These known models/voices should be from a reputable source
- Its up to you how you want to structure this file. AI systems can help you design/build it
- This will be used by [engine_name]_settings_page.py for its Gradio interface downloads section
- You will be working on this file
- Location:
/system/tts_engines/[engine_name]/available_models.json
-
Engine's Gradio UI settings page & its help file for the expandable accordians (
[engine_name]_settings_page
&help_content.py
)- Is automatically found & imported into the Gradio interface as long as the filename matches
[engine_name]
and remember use the same CAPS/Non-Caps spelling throughout - The built in default engine settings page is controlled/configured by what is found in the
model_settings.json
file - You will have to re-name some of the function names in this file to
def [engine_name]_function_name
or Gradio will fail import - You will be building code here to create your
alltalk_tts/models/[engine_name]/
folder - You will be building code here download model files into
alltalk_tts/models/[engine_name]/
- The locations you specify in the code, should be the same locations used in
model_engine.py
- You may have to create other tabs/code in here for other potential features you want presented to the user
- Some of the existing markdown help in
help_content.py
should remain to build the UI help accordians - Add your own markdown sections to
help_content.py
for any engine specific help you want to add - You will be working on this file
- Location:
/system/tts_engines/[engine_name]/[engine_name]_settings_page
- Location:
/system/tts_engines/[engine_name]/help_content.py
- Is automatically found & imported into the Gradio interface as long as the filename matches
-
Auto add a new TTS engine to AllTalk (
new_engines.json
)- When people update (
git pull
) AllTalk,new_engines.json
is updated along with any new engine code (the files above) - When AllTalk starts, any new TTS engines and its default model file specified are merged into their
tts_engines.json
fromnew_engines.json
if the listed TTS engine & its default specified model doesn't exist yet - You will be working on this file
- Location:
/system/tts_engines/new_engines.json
- When people update (
When integrating a new TTS engine, we need to:
- Maintain consistent behavior with other engines
- Provide proper model and resource management
- Handle errors and edge cases gracefully
- Support features like low VRAM mode when applicable
- Provide clear user feedback and debugging information
Integrating a new TTS engine into AllTalk involves modifying several key components of the system. Before you start, it's important to consider a few key questions to help guide your integration and avoid pitfalls later on. This section will help you think through critical aspects related to naming conventions, file structures, installation methods, dependencies, and more.
- Consistency: Determine a clear, consistent name for the TTS engine that will be used throughout the codebase, folder names, and configuration files. Once chosen, this name must remain consistent in capitalization and format in all code, paths, and settings files.
- Uniqueness: Make sure the name is unique within AllTalk. Avoid using names that may overlap with existing engines or internal system names to avoid confusion and potential conflicts.
-
AI Model vs. Voice Model Files: Understand how the TTS engine handles models:
- Does it use a large AI model that can perform zero-shot voice cloning from an audio sample (e.g., many modern AI-driven TTS systems)?
- Or does it rely on individual pre-trained voice model files, where each model represents a specific voice and language?
-
Storage Strategy:
- If it uses individual voice model files, decide how you want these models to be structured in the AllTalk system. These voice models need to be stored in the
/alltalk_tts/models/[engine_name]/
folder, and the structure must make it easy for users to navigate/manage. Usually individual folders below the[engine_name]
folder is the way to go. - In
available_models.json
, how will these models be listed? Perhaps the files naming convention allows you to group files within your code for downloads e.g. maybe all English voice model files arename_en_file.pth
with theen
meaning English.
- If it uses individual voice model files, decide how you want these models to be structured in the AllTalk system. These voice models need to be stored in the
-
Single Voice vs. Voice Packs:
- If the engine uses individual voice model files, decide how users will download them.
- Consider providing users with voice packs, such as "All English Voices," to make the model download process more convenient. This can be especially beneficial if the TTS engine provides multiple models for different languages or accents.
-
How Will the Engine Be Installed?
- Consider the method by which the TTS engine will be integrated into the current Python environment.
-
Installation Options:
- Is there a simple
pip install
command available? If so, this is often the easiest and most reliable way to manage dependencies. - Does the TTS engine require cloning a repository and installing manually (e.g., using
git+https://github.com/...
)? This approach requires additional checks and version control.
- Is there a simple
- Location of Installation Files: Decide whether to install dependencies into your TTS engine directory or into the alltalk shared python environment.
- There can be situations, like with Piper TTS where you need 2x different methods for Windows vs Linux. With Piper, Windows uses code under the Engine folder and Linux pip installs to the alltalk Python environment directly.
- The above situation for Piper also meant the generation code had to determine the OS it was generating TTS on to use the correct method.
-
On-Demand Installation:
- Consider installing required packages on first use of the TTS engine (like F5-TTS's in AllTalk). You can add a try/except block at the top of
model_engine.py
to install any missing dependencies dynamically. - This approach is beneficial because it reduces the initial setup overhead for AllTalk and ensures users only install what they need.
- Consider installing required packages on first use of the TTS engine (like F5-TTS's in AllTalk). You can add a try/except block at the top of
- Potential User Experience: Keep in mind that installing dependencies on the fly might lead to a slight delay when the engine is used for the first time, so it may be helpful to inform users if an installation is taking place.
-
Shared vs. Conflicting Dependencies:
- Determine whether adding the new TTS engine introduces any dependencies that might conflict with those used by other TTS engines already integrated into AllTalk.
- Does It Really Matter?: In many cases, differences in requirements can be tolerated without significant impact, but for critical packages, conflicts could cause instability. At least note any dependency conflicts.
-
UI Customization:
- Consider how much customization is required for the user interface (
[engine_name]_settings_page.py
andhelp_content.py
). The complexity of the UI depends on whether you need advanced controls for the engine or if the default UI settings page is sufficient. - User Experience: Plan how to present features in a user-friendly way. If the TTS engine has many configurations or advanced options, make sure they are organized logically, possibly using tabs or collapsible sections in Gradio.
- Consider how much customization is required for the user interface (
-
Graceful Failure: Consider how to handle errors gracefully, particularly during model loading or voice generation.
- Providing descriptive error messages should your code fail is beneficial. Much of this should already be covered in the template code.
-
Debug Logging: Add logging to
model_engine.py
and other relevant files. This will help users troubleshoot and provide meaningful feedback if they encounter issues during the integration or usage of the TTS engine. - The debug options list is available here and you would typically add
debug_func
,debug_tts
anddebug_tts_variables
to your engine and use the print_message function to automatically colour code and determine if debug printing is on or off at this time.
-
Licensing: In the
model_settings.json
you can link to the original TTS engine developer and also note any licensing information if necessary. - Community Contribution: If you plan on sharing this integration with the AllTalk community, consider writing clear documentation on how your engine works, any special features it has, and instructions for other users to set it up.
import torch
import logging
from pathlib import Path
from fastapi import HTTPException
# Engine-specific imports (example from F5-TTS)
from f5_tts.model import CFM, DiT
from f5_tts.model.utils import get_tokenizer, convert_char_to_pinyin
from vocos import Vocos
The tts_class
contains several critical sections that must be implemented:
- Initialization
def __init__(self):
# Base variables (DO NOT MODIFY)
self.branding = None
self.device = "cuda" if torch.cuda.is_available() else "cpu"
# ... other base variables ...
# Engine-specific parameters
# Example from F5-TTS:
self.target_sample_rate = 24000
self.n_mel_channels = 100
# ... other engine parameters ...
- Model Management Functions These core functions must be implemented for all engines:
async def setup(self):
"""Initial model setup and loading"""
async def handle_lowvram_change(self):
"""Handle moving model between CPU/GPU for low VRAM mode"""
async def handle_deepspeed_change(self, value):
"""DeepSpeed integration if supported"""
def scan_models_folder(self):
"""Scan for available models"""
def voices_file_list(self):
"""List available voices/samples"""
async def generate_tts(self, text, voice, language, temperature,
repetition_penalty, speed, pitch,
output_file, streaming):
"""Main TTS generation function"""
-
scan_models_folder()
- Must return dictionary of available models
- Handle "No Models Found" case
- Example structure:
{ "model_name": "engine_name - model_name", "No Models Found": "No Models Found" # If no models available }
-
voices_file_list()
- Return list of available voices
- Handle voice file validation
- Example from F5-TTS with reference text:
def voices_file_list(self): voices = [] directory = self.main_dir / "voices" def has_reference_text(wav_path): text_path = wav_path.with_suffix('.reference.txt') return text_path.exists() # Scan for valid voice files for f in directory.glob("*.wav"): if has_reference_text(f): voices.append(f.name) return voices if voices else ["No Voices Found"]
-
generate_tts()
- Core generation function
- Must handle all parameters regardless of engine support
- Include proper error handling
- Handle streaming if supported
- Example error handling:
if not self.is_tts_model_loaded: raise HTTPException(status_code=400, detail="No TTS model loaded")
{
"model_details": {
"manufacturer_name": "Engine Name",
"manufacturer_website": "https://...",
"model_description": "Detailed description..."
},
"model_capabilties": {
"audio_format": "wav",
"deepspeed_capable": false,
"generationspeed_capable": true,
// ... other capabilities ...
},
"settings": {
"def_character_voice": "default.wav",
// ... other settings ...
}
}
-
Required Engine Files
- Identify core model files (weights, configs)
- Identify required supporting files (vocoder, tokenizer)
- Determine Python package dependencies
- Example from F5-TTS:
try: from f5_tts.model import CFM, DiT from vocos import Vocos except ImportError: install_and_restart() # Custom installation function
-
Model File Structure
models/ └── [engine_name]/ └── [model_version]/ ├── model.safetensors/pth/onnx ├── config.json/yaml └── supporting_files/
-
Voice File Organization
voices/ ├── voice1.wav ├── voice1.reference.txt # If reference text needed └── subfolders/ # Optional organization ├── voice2.wav └── voice2.reference.txt
-
Voice Validation
- Check file format compatibility
- Verify required companion files
- Example validation:
def validate_voice_file(voice_path): if not voice_path.exists(): return False if voice_path.suffix != '.wav': return False if needs_reference_text: if not voice_path.with_suffix('.reference.txt').exists(): return False return True
The [engine_name]_settings_page.py
file should implement:
-
Basic Functions
def engine_name_voices_file_list(): """List available voices""" def engine_name_model_update_settings(...): """Update engine settings""" def engine_name_model_alltalk_settings(model_config_data): """Main settings page implementation"""
-
UI Components
- Model selection
- Voice management
- Engine-specific settings
- Help documentation Example:
with gr.Blocks() as app: with gr.Tab("Default Settings"): # Basic settings with gr.Row(): lowvram_enabled_gr = gr.Radio(...) speed_slider = gr.Slider(...) with gr.Tab("Reference Text Manager"): # Voice management with gr.Row(): file_list = gr.Dropdown(...) text_editor = gr.Textbox(...)
-
Help Documentation Include comprehensive help in Markdown format:
gr.Markdown(""" ### 🟧 Engine Name Help Detailed explanation of: - Model locations - Voice requirements - Best practices - Troubleshooting """)
{
"first_start_model": "model_v1",
"models": [
{
"model_name": "model_v1",
"folder_path": "model_v1",
"files_to_download": {
"model.file": "https://url/to/file",
"config.file": "https://url/to/config",
"subfolder/file": "https://url/to/subfile"
}
}
]
}
-
File Management
def download_model(model_name, force_download=False): # Find model in config selected_model = next( model for model in available_models["models"] if model["model_name"] == model_name ) # Setup paths base_path = main_dir / "models" / "engine_name" model_path = base_path / selected_model["folder_path"] # Download files for file_name, url in selected_model["files_to_download"].items(): download_file(url, model_path / file_name)
-
Progress Tracking
def download_file(url, path): response = requests.get(url, stream=True) total_size = int(response.headers.get('content-length', 0)) with tqdm(total=total_size, unit='iB', unit_scale=True) as pbar: with open(path, 'wb') as f: for data in response.iter_content(1024): pbar.update(len(data)) f.write(data)
-
Debug Flags
self.debug_tts = configfile_data.get("debugging").get("debug_tts") self.debug_tts_variables = configfile_data.get("debugging").get("debug_tts_variables")
-
Debug Print System
def debug_print(self, message, type="debug"): if self.debug_tts: prefix = { "debug": "\033[94mDebug", "warning": "\033[93mWarning", "error": "\033[91mError" }.get(type, "\033[94mDebug") print(f"[{self.branding}ENG] {prefix}: {message}\033[0m")
-
Key Debug Points
async def api_manual_load_model(self, model_name): try: self.debug_print(f"Loading model: {model_name}") self.debug_print(f"Device: {self.device}") if self.device == "cuda": self.debug_print("CUDA Memory before load: " f"{torch.cuda.memory_allocated()/1024**2:.2f}MB") # Model loading code... if self.device == "cuda": self.debug_print("CUDA Memory after load: " f"{torch.cuda.memory_allocated()/1024**2:.2f}MB") except Exception as e: self.debug_print(f"Error loading model: {str(e)}", "error") raise
-
Model Loading Errors
async def handle_model_load_error(self, error): if "CUDA out of memory" in str(error): message = ("CUDA out of memory. Try enabling Low VRAM mode " "or using a smaller model.") elif "No such file" in str(error): message = "Model files missing. Please download the model first." else: message = f"Unknown error loading model: {str(error)}" self.debug_print(message, "error") raise HTTPException(status_code=500, detail=message)
-
Voice File Validation
def validate_voice_requirements(self, voice_path): errors = [] if not voice_path.exists(): errors.append(f"Voice file not found: {voice_path}") if voice_path.suffix != '.wav': errors.append("Voice file must be WAV format") if self.needs_reference_text: ref_text = voice_path.with_suffix('.reference.txt') if not ref_text.exists(): errors.append("Missing reference text file") if errors: error_msg = "\n".join(errors) self.debug_print(error_msg, "error") raise ValueError(error_msg)
-
Device Tracking
class DeviceManager: def __init__(self, engine): self.engine = engine self.current_device = "cuda" if torch.cuda.is_available() else "cpu" async def ensure_on_device(self, target_device): if self.current_device != target_device: await self.move_to_device(target_device) async def move_to_device(self, target_device): if not hasattr(self.engine, 'model'): return # Convert precision as needed if target_device == "cuda": self.engine.model = self.engine.model.half().to(target_device) else: self.engine.model = self.engine.model.float().to(target_device) self.current_device = target_device
-
Generation With Low VRAM
async def generate_tts(self, text, voice, ...): try: if self.lowvram_enabled: # Move to GPU for generation await self.handle_lowvram_change() # Generate TTS... finally: if self.lowvram_enabled and not self.tts_narrator_generatingtts: # Move back to CPU unless more narrator text coming await self.handle_lowvram_change()
-
Model Management
async def test_model_lifecycle(): engine = tts_class() # Test initialization assert engine.is_tts_model_loaded == False # Test model loading await engine.setup() assert engine.is_tts_model_loaded == True # Test model unloading await engine.unload_model() assert engine.is_tts_model_loaded == False
-
Voice Generation
async def test_voice_generation(): engine = tts_class() await engine.setup() test_cases = [ ("Hello world", "voice1.wav", "en"), ("Multiple words test", "voice2.wav", "en"), # Add more test cases... ] for text, voice, language in test_cases: output_file = f"test_{voice}.wav" await engine.generate_tts( text=text, voice=voice, language=language, temperature=0.7, repetition_penalty=1.0, speed=1.0, pitch=0, output_file=output_file, streaming=False ) assert os.path.exists(output_file)
-
Batch Processing
def chunk_text(self, text, max_chars=135): """Split long text into manageable chunks""" chunks = [] sentences = re.split(r"(?<=[;:,.!?])\s+|(?<=[;:,。!?])", text) current_chunk = "" for sentence in sentences: if len(current_chunk.encode("utf-8")) + len(sentence.encode("utf-8")) <= max_chars: current_chunk += sentence + " " else: chunks.append(current_chunk.strip()) current_chunk = sentence + " " if current_chunk: chunks.append(current_chunk.strip()) return chunks
-
Cross-Fade Implementation
def apply_crossfade(self, audio_segments, fade_duration, sample_rate): """Smoothly join audio segments""" if fade_duration <= 0: return np.concatenate(audio_segments) final_wave = audio_segments[0] for next_segment in audio_segments[1:]: fade_samples = int(fade_duration * sample_rate) fade_samples = min(fade_samples, len(final_wave), len(next_segment)) fade_out = np.linspace(1, 0, fade_samples) fade_in = np.linspace(0, 1, fade_samples) overlap_end = final_wave[-fade_samples:] * fade_out overlap_start = next_segment[:fade_samples] * fade_in final_wave = np.concatenate([ final_wave[:-fade_samples], overlap_end + overlap_start, next_segment[fade_samples:] ]) return final_wave
class ResourceManager:
def __init__(self):
self.temp_files = []
def register_temp_file(self, path):
self.temp_files.append(path)
def cleanup(self):
for path in self.temp_files:
try:
if os.path.exists(path):
os.remove(path)
except Exception as e:
print(f"Failed to remove temp file {path}: {e}")
self.temp_files.clear()
-
Audio Normalization
def normalize_audio(self, audio_data, target_db=-23): """Normalize audio to target dB""" rms = np.sqrt(np.mean(np.square(audio_data))) target_rms = 10 ** (target_db / 20) gain = target_rms / (rms + 1e-8) return audio_data * gain
-
Sample Rate Conversion
def ensure_sample_rate(self, audio_data, source_rate, target_rate): """Convert audio to target sample rate""" if source_rate == target_rate: return audio_data resampler = torchaudio.transforms.Resample( source_rate, target_rate ) return resampler(audio_data)
class ProgressTracker:
def __init__(self, total_steps):
self.total = total_steps
self.current = 0
self.start_time = time.time()
def update(self, steps=1):
self.current += steps
elapsed = time.time() - self.start_time
eta = (elapsed / self.current) * (self.total - self.current)
return {
"progress": self.current / self.total * 100,
"elapsed": elapsed,
"eta": eta
}
-
Function Documentation Template
def function_name(self, param1, param2): """ Brief description of function purpose. Args: param1 (type): Description of param1 param2 (type): Description of param2 Returns: type: Description of return value Raises: ErrorType: Description of when this error occurs """
-
Class Documentation Template
class ClassName: """ Brief description of class purpose. Attributes: attr1 (type): Description of attr1 attr2 (type): Description of attr2 Methods: method1: Brief description method2: Brief description """
-
Settings Page Help Format
gr.Markdown(""" # Engine Name Help ## Model Installation 1. Download the models using the Models tab 2. Place voice samples in the voices folder 3. Configure voice settings as needed ## Voice Requirements - Format: WAV files - Duration: Recommended 5-15 seconds - Quality: Clear speech, minimal background noise ## Troubleshooting Common issues and solutions... ## Best Practices Tips for optimal results... """)
def check_version_compatibility():
"""Check compatibility with AllTalk version"""
min_version = "2.0.0"
current = get_alltalk_version()
if parse_version(current) < parse_version(min_version):
raise CompatibilityError(
f"This engine requires AllTalk {min_version} or higher"
)
-
Model Updates
async def update_model_files(self): """Update model files while preserving settings""" # Backup current settings settings_backup = self.get_current_settings() # Update model files await self.download_latest_models() # Restore settings self.restore_settings(settings_backup)
-
Configuration Updates
def update_config_structure(): """Update config files to latest format""" for config_file in CONFIG_FILES: current = load_config(config_file) updated = migrate_config(current) save_config(config_file, updated)