Skip to content

TTS Generator

erew123 edited this page Sep 30, 2024 · 2 revisions

The AllTalk TTS Generator is a powerful solution for converting large volumes of text into speech using the voice of your choice. Whether you're creating audiobooks, generating voice content, or simply want to hear text read aloud, the TTS Generator is equipped to handle it efficiently.

Accessing the TTS Generator

The link to open the TTS Generator can be found in the Gradio interface or from the HTML interface, but typically it would be on http://127.0.0.1:7851/static/tts_generator/tts_generator.html when AllTalk is loaded.

Performance Recommendations

  • DeepSpeed is highly recommended to speed up generation (with TTS Engines that support it like XTTS).
  • Low VRAM is best turned off, and your LLM model should be unloaded from your GPU VRAM.
  • Use No Playback option for very large generations (15,000 words or more) to reduce memory overhead.
  • When exporting to WAV, splitting into smaller groups will reduce memory overhead, beneficial for low-memory systems.

Estimated Throughput

Performance will vary by system, but as a reference:

  • 58,000 word document
  • DeepSpeed enabled, LowVram disabled
  • Splitting size 2
  • Nvidia RTX 4070
  • Result: ~1,000 words per minute (58 minutes total)
  • Exporting to combined WAVs: 2-3 minutes

Quick Start Guide

  1. Text Input: Enter your text in the 'Text Input' box.
  2. Generate TTS: Click to start the text-to-speech conversion.
  3. Pause/Resume: Control playback of generated audio or stream.
  4. Stop Playback: Stops current audio playback (does not stop text generation).

Note: The TTS server remains busy until the process is complete. Plan your generation requests accordingly.

Customization and Preferences

  • Character Voice: Select the voice for your text.
  • RVC Voice: Choose a Retrieval-based Voice Conversion voice for additional modification, or leave Disabled to bypass RVC.
  • RVC Pitch: Adjust the pitch of the RVC voice (-24 to +24).
  • Language: Choose the language of your text. Not all TTS Engines support all languages.
  • Chunk Sizes: Set the size of text chunks for generation (smaller sizes recommended for better quality, depending on TTS engine used).
  • Custom File Name: Set a name for your output files for easy identification on the Outputs folder.

Interface Features

  • Dark/Light Mode: Switch between visual themes.
  • Word Count and Generation Queue: Monitor progress of your generation.

TTS Generation Modes

Wav Chunks

  • Ideal for audiobooks or long-term storage.
  • Breaks text into manageable WAV files.
  • Allows editing and regeneration of specific portions.
  • Playback options: "In Browser", "On Server", or "No Playback".
  • Only "In Browser" playback populates the Generated TTS List.

Streaming

  • For immediate playback without saving.
  • Plays through your browser.
  • Cannot be stopped once generation begins.

Memory Management for Large Generations

When working with extensive text-to-speech generations, it's important to consider the memory limitations of web browsers and your system:

  1. Browser Memory Usage: Web browsers need to store a catalogue of all generated TTS files and audio in memory. This can become resource-intensive for very large generations.

  2. System RAM Considerations: Systems with 16GB of RAM or less may struggle with very large generations (20,000 words or more) due to the memory requirements of storing this catalogue.

To manage these limitations effectively:

  • Use the No Playback option to minimize browser memory usage during generation.
  • For very large texts (20,000+ words), consider breaking them into smaller blocks of 5,000 to 10,000 words each.
  • Generate these smaller blocks separately and then combine the resulting audio files using external software like Audacity.
  • Combine these small audio chunks externally using audio editing software like Audacity to join multiple generated files for very large projects.

Playback and List Management

  • Play List: Start playback from the beginning.
  • Stop Playback: Halt audio at any time.
  • Custom Start: Begin playback from a specific ID.
  • Regeneration and Editing: Edit text or regenerate specific chunks.
  • Export/Import List: Save as JSON for backup or future editing.

Exporting Options

Export to WAV

  • Combines all generated TTS into a single WAV file.
  • File size limit: 1GB per exported file. Multiple file exports can be merged with Audacity.
  • Choose number of files per export block (recommended: 500 or less)
  • Lower export batches reduce memory requirements, beneficial for 8GB or 16GB systems.

Exporting and Importing JSON

The JSON export and import feature is a crucial tool for managing your TTS generation projects:

  • Purpose: Save your work, transfer between sessions, or create backups.
  • Content: Stores the catalogue of your TTS generations, including text and file references.
  • Limitation: Does not include the actual generated audio files.

Important: The exported JSON relies on the audio files in your outputs folder. If you delete these files, the JSON will no longer be able to access the generated audio.

Export SRT

Generates a subtitle file matching your exported WAV file with the correct timestamps in the SRT file. This can be useful if you wish to use the generated audio in a video file with subtitles on screen.

Analyzing Generated TTS

The Analyze TTS feature scans WAV files and compares original text with generated TTS, flagging inconsistencies.

  • Uses Whisper Larger-v2 AI engine (2.5GB download on first use).
  • Adjustable accuracy percentage (96-98% recommended for best results).
  • Test with small text samples (10-20 lines) to understand detection patterns.
  • Results viewable in terminal/command prompt window.
  • Usable on non-Nvidia GPU systems, but may be slow on CPUs.

Accuracy Considerations

The analyzer attempts to identify similar-sounding words (e.g., "their" vs "there") to reduce false positives. Higher accuracy settings may increase unwanted detections.

Tips for Correct Pronunciation

Adding Pauses

Use semi-colons (;) and colons (:) to create pauses, similar to periods (.).

Handling Acronyms

For correct pronunciation of acronyms like "ChatGPT":

  • Chat G P T.
  • Chat G,P,T.
  • Chat G.P.T.
  • Chat G-P-T.
  • Chat gee pee tea (phonetic approach)

Best Practices

  • Keep text chunks under 250 characters for smooth generation.
  • The generator remembers settings between sessions.
  • Export to JSON regularly for backup.
  • Test voice and settings with smaller text portions initially.
  • For audiobooks, consider exporting in sections and combining with external software such as Audacity.
Clone this wiki locally