Skip to content

This project includes a Python script for fine-tuning a text-to-speech (TTS) model. The script utilizes custom datasets and use CUDA for accelerated training.

Notifications You must be signed in to change notification settings

MrZeroX1/Renan-Platform

 
 

Repository files navigation

English Arabic

Fine-Tuning TTS Model For Arabic Language

Team Members

Features

  • High-performance Deep Learning xtts model for Text2Speech tasks.
    • Speaker Encoder to compute speaker embeddings efficiently.
  • Fast and efficient model training.
  • Detailed training logs on the terminal and Tensorboard.
  • Support for Multi-speaker TTS.
  • Efficient, flexible, lightweight but feature complete Trainer API.
  • Released and ready-to-use models.
  • Utilities to use and test your models.
  • Modular (but not too much) code base enabling easy implementation of new ideas.

Prequests

  • 3.9.x <= Python < 3.12
  • CUDA >= 11.8

Dataset We Use in This Project

As a team, we use the SADA2022 dataset from Sadaia, which includes a wide range of Arabic dialects. Below are the steps to clean the SADA dataset to make it ready for training:

  1. Download SADA2022 Dataset and unzip it.
  2. Use the Cleaning.ipynb file from the Prepare_your_Data folder. Manually remove " and متحدث from the data (use Ctrl + H to replace the text).
  3. Use audio_segment.py from the Prepare_your_Data folder to segment the audio data you just cleaned and save it in a directory.
  4. Use split_data.ipynb from the Prepare_your_Data folder to split the data into 70% training and 30% testing sets.

Installation

Follow these steps for installation:

  1. Ensure that CUDA is installed
  2. Clone the repository:
git clone https://github.com/Haurrus/xtts-trainer-no-ui-auto
  1. Navigate into the directory:
cd xtts-trainer-no-ui-auto
  1. Create a virtual environment:

    • On Terminal
    python -m venv myenv
    • On Anaconda
    conda create --name myenv python=3.11.9
  2. Activate the virtual environment:

    • On Anaconda :
    conda activate myenv
    • On Windows use :
    venv\scripts\activate
    • On linux use:
    source venv\bin\activate
  3. Ensure that you install CUDA Toolkit 12.4 from their official site CUDA 12.4 (I choose 12.4 not 12.5 because the PyTorch only support 12.4 for now)

  4. Install PyTorch and torchaudio with pip command :

    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
  5. Install all dependencies from requirements.txt :

    pip install -r requirements.txt

During the implementation

This is a Python script for fine-tuning a text-to-speech (TTS) model for xTTSv2. The script utilizes custom datasets and use CUDA for accelerated training.

To use the script, you need to specify two JSON files: args.json and datasets.json.

args.json

{
  "num_epochs": 0,
  "batch_size": 3,
  "grad_acumm": 84,
  "max_audio_length": 15,
  "language": "ar",
  "version": "",
  "json_file": "",
  "custom_model": ""
}

This file should contain the following key parameters:

  • num_epochs: Number of epochs for training, if set to 0 it will auto calculate it.
  • batch_size: Batch size for training.
  • grad_acumm: Gradient accumulation steps.
  • max_audio_length: max audio duration of wavs used to train.
  • language: language used to train the model.
  • version: by default main from xTTSv2
  • json_file: by default main from xTTSv2
  • custom_model: by default main from xTTSv2

datasets.json

[
  {
    "path": "path/to/dataset1",
    "activate": true
  },
  {
    "path": "path/to/dataset2",
    "activate": false
  }
]

This file should list the datasets to be used with paths and activation flags.

Running the Script

Execute the script with the following command:

python xtts_finetune_no_ui_auto.py --args_json path/to/args.json --datasets_json path/to/datasets.json

If you running the script with the same folder for args.json & datasets.json, run this command:

python xtts_finetune_no_ui_auto.py --args_json args.json --datasets_json datasets.json

Error handling

This section addresses some of the errors encountered while trying to execute the code.

torch.OutOfMemoryError

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacity of 6.00 GiB of which 0 bytes is free. Of the allocated memory 8.38 GiB is allocated by PyTorch, and 471.68 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management https://pytorch.org/docs/stable/notes/cuda.html#environment-variables

Potential Causes:
  • Lack of Memory: The minimum requirement for running the code is 16 GB of RAM. Upgrade your device or find another device with sufficient memory.

  • Lack of GPU: Ensure that you have a GPU with at least 4 GB of VRAM, as deep learning training requires significant GPU resources.

Access is deniend

[WinError 5] Access is denied:'path\\to\\dataset\\run'

Remove the run folder from the dataset.

OSError: [WinError 1455]

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading ".root\.conda\envs\py11\Lib\site-packages\torch\lib\cufft64_11.dll" or one of its dependencies.

This error is similar to torch.OutOfMemoryError and can be resolved by addressing memory issues.

Future Warning

These warnings can be safely ignored.

What is the Output of the Training

PS C:\xtts-trainer-no-ui-auto> python xtts_finetune_no_ui_auto.py --args_json args.json --datasets_json datasets.json

Checking dataset in path: C:\xtts-trainer-no-ui-auto\outwaves
Looking for dataset at: C:\xtts-trainer-no-ui-auto\outwaves
 > Loading custom model: C:\xtts-trainer-no-ui-auto\models\main\model.pth

>> DVAE weights restored from: C:\xtts-trainer-no-ui-auto\models\main\dvae.pth
 | > Found 25786 files in 

 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: False
 | > Precision: float32
 | > Num. of CPUs: 12
 | > Num. of GPUs: 1
 | > Num. of Torch Threads: 1
 | > Torch seed: 1
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=C:\xtts-trainer-no-ui-auto\outwaves\run\training\GPT_XTTS_FT-August-05-2024_05+14PM-8025ef4

 > Model has 518442047 parameters

 > EPOCH: 0/50
 --> C:\xtts-trainer-no-ui-auto\outwaves\run\training\GPT_XTTS_FT-August-05-2024_05+14PM-8025ef4
 > Filtering invalid eval samples!!
 > Total eval samples after filtering: 40

 > EVALUATION 


  --> EVAL PERFORMANCE
     | > avg_loader_time: 0.18054273189642486 (+0)
     | > avg_loss_text_ce: 0.033357221107834435 (+0)
     | > avg_loss_mel_ce: 4.670955278934577 (+0)
     | > avg_loss: 4.704312489582942 (+0)


 > EPOCH: 1/100
 --> C:\xtts-trainer-no-ui-auto\outwaves\run\training\GPT_XTTS_FT-August-05-2024_05+14PM-8025ef4
 > Sampling by language: dict_keys(['ar'])

 > TRAINING (2024-08-05 17:17:08) 
 --> TIME: 2024-08-03 02:51:47 -- STEP: 0/14 -- GLOBAL_STEP: 0�[0m
     | > loss_text_ce: 0.0332101508975029  (0.0332101508975029)
     | > loss_mel_ce: 4.7347731590271  (4.7347731590271)
     | > loss: 0.05676170811057091  (0.05676170811057091)
     | > current_lr: 5e-06 
     | > step_time: 0.9667  (0.9666781425476074)
     | > loader_time: 79.179  (79.178950548172)


  > EVALUATION


--> EVAL PERFORMANCE
     | > avg_loader_time: 0.028109293717604417 (+0.0017166871290940494)
     | > avg_loss_text_ce: 0.032951588957355574 (-7.596425712108612e-05)
     | > avg_loss_mel_ce: 4.6023083833547735 (-0.07368043752817055)
     | > avg_loss: 4.635259958413931 (-0.07375643803523246)

 > BEST MODEL : C:\xtts-trainer-no-ui-auto\outwaves\run\training\GPT_XTTS_FT-August-05-2024_05+14PM-8025ef4\best_model_14.pth

 > EPOCH: 2/100
 -->  C:\xtts-trainer-no-ui-auto\outwaves\run\training\GPT_XTTS_FT-August-05-2024_05+14PM-8025ef4

TRAINING (2024-08-03 02:52:07)

Warning : This project is not developed by us; we are merely using it to train the model for Najdi and Khaliji dialects support.

All thanks to Haurrus/xtts-trainer-no-ui-auto for the works he done on this project.

About

This project includes a Python script for fine-tuning a text-to-speech (TTS) model. The script utilizes custom datasets and use CUDA for accelerated training.

Topics

Resources

Stars

Watchers

Forks

Languages

  • Python 66.9%
  • Jupyter Notebook 33.1%