This repository hosts the code and resources for the Voice Cloning project, which aims to create a voice cloning model using deep learning techniques. The project focuses on developing a custom Text-to-Speech (TTS) model that can generate natural-sounding speech for a variety of speakers by training on text and corresponding audio datasets.
- Introduction
- Problem Statement
- Methodology
- Requirements
- Model Architecture
- Training
- Results
- Future Work
- References
Voice cloning is the replication of a person’s voice using deep neural networks. The goal of this project is to build a customized TTS model capable of generating natural speech, trained on speaker-specific datasets to produce voice outputs similar to the speaker in question.
The project involves creating an AI system that can mimic a person's voice by analyzing audio recordings and text transcripts. This model synthesizes speech directly from (text, audio) pairs, achieving a high-quality speech synthesis that captures speaker nuances, intonation, and style.
The project uses a deep learning approach that involves:
- Text Encoding: Processing input text to generate linguistic and phonetic representations.
- Feature Extraction from Mel-Spectrogram: Converting audio into Mel-spectrograms, which serve as intermediate representations in the pipeline.
- Audio Generation: Converting Mel-spectrograms to audio using algorithms like Griffin-Lim or a vocoder for high-quality output.
The model architecture includes convolutional and recurrent neural networks, particularly leveraging:
- A Text Encoder for linguistic features
- A Feature Extractor for Mel-spectrograms
- Concatenation and Convolutional Layers to merge features and produce the final spectrogram
To run this project, the following dependencies are required:
- Python 3.8+
- TensorFlow
- Keras
- Librosa
- NumPy
- Matplotlib
The model is trained on audio datasets with paired text transcripts:
The model architecture follows a three-stage approach:
- Text Encoder: Encodes input text with 1-D convolutional layers and Bi-directional LSTM to capture linguistic features.
- Feature Extraction from Mel-Spectrogram: Converts audio input to Mel-spectrograms using parameters like sample rate, FFT window, and Mel bands.
- Concatenation and Audio Generation: Combines text and audio features, generating Mel-spectrograms that are then converted to audio using Griffin-Lim or vocoders.
The training uses the Adam optimizer with Mean Squared Error (MSE) and Mean Absolute Error (MAE) as metrics to evaluate model performance.
The model achieved a Mean Opinion Score (MOS) of 3.2 ± 0.2, demonstrating close-to-human naturalness for single-speaker synthesis. Improvements in model performance can be achieved with increased computing resources and larger datasets.
- Limited speaker generalization due to a small, single-speaker dataset.
- Noise in the output due to phase reconstruction losses.
Future improvements could include: • Training on a larger multi-speaker dataset to enhance generalization. • Employing state-of-the-art vocoders for higher audio quality. • Optimizing the architecture to reduce computational load and improve synthesis speed.
This project builds upon the following works: