This project was developed as part of the Techolution AI Done Right Hackathon. This repository contains the code and instructions for creating a near real-time digital twin using advanced voice cloning and realistic video generation techniques. Our solution aims to tackle the challenge of generating a digital clone of a person, combining their voice, expressions, and speech in a lifelike manner.
The challenge presented in the hackathon involves creating AI models capable of the following:
-
Advanced Neural Architectures: Utilizing state-of-the-art deep learning techniques, including recurrent neural networks (RNNs), convolutional neural networks (CNNs), and generative adversarial networks (GANs) for voice cloning and spoof video generation.
-
Expressiveness: Developing models that can faithfully capture a wide range of emotions, accents, and speaking styles, allowing for expressive voice cloning and fluid video generation from a 2D image.
-
Naturalness: Ensuring that the generated voice clones sound natural and human-like, while focusing on generating accurate lip sync and realistic video corresponding to cloned audio.
-
Robustness: Enhancing the robustness of the AI models to perform well with limited training data and in challenging acoustic environments. For video, the goal is to minimize traces of fake elements.
-
Real-Time Nature: Creating an ensemble of voice cloning and spoof video generation models that operate in near real-time, making it suitable for conversational AI applications.
We approached this problem by leveraging two separate components:
-
Voice Cloning and Text-to-Speech (TTS): We utilize the Tortoise-TTS repository for both voice cloning and generating speech from a text prompt. This component allows users to upload a sample audio file for voice cloning and specify a text prompt for generating the cloned voice.
-
Realistic Video Generation: For generating realistic videos with lip-sync, we used the SadTalker repository. It takes an input image, an audio file from the voice cloning step, and generates a video with lip-sync.
To accommodate the processing requirements and avoid crashes, we used separate Google Colab instances for each component. We also configured ngrok with Flask to create accessible URLs for integration with a Streamlit application.
To run the complete system, follow these steps:
-
Upload
TorTTS_API.ipynb
to one Colab instance andVid_API.ipynb
to another Colab instance. -
Configure ngrok APIs in both instances.
-
Enter the ngrok URLs generated in step 2 into the
app.py
file. -
Run the Streamlit application using the command:
streamlit run app.py
. -
In the Streamlit application, users can perform the following steps:
- Upload a sample audio file (.wav) of 10 to 15 seconds for voice cloning.
- Specify a text prompt for generating speech.
- Upload an image (.png) of the person whose voice is to be cloned.
- The system will generate a video with lip-sync using the audio and image provided.
Demo.mp4
-
If you have access to powerful GPUs, you can combine both
TorTTS_API.ipynb
andVid_API.ipynb
to run in a single Colab instance for faster processing or in your local machine. -
Note that the Streamlit application is designed for local use, so users will need to clone this repository and run the application on their own machines.