The Synthetic Data Pipeline is designed to process large raw-text dataset into trainable sound. This pipeline leverages:
- First, WhisperSpeech to convert text into speech.
- Second, Encodec to convert into trainable sound tokens.
However, as WhisperSpeech and Encodec do not support batch processing, this pipeline is designed to process data in parallel across multiple GPUs in multiple processes.
- Multi-GPU and multi-process for each GPU
- Text-to-speech conversion using WhisperSpeech
- Audio tokenization using EncodecModel
- CSV output for easy integration and recovery
- Configurable batch processing
- Progress tracking and logging
- Multiple CUDA-capable GPUs (tested with 4 GPUs)
- Sufficient RAM to handle large datasets
- Fast storage for input/output operations
- torch
- pyarrow
- datasets
- WhisperSpeech
- encodec
- Dependencies listed in
requirements.txt
-
Move to the directory:
cd synthetic-data
-
Install the required packages:
pip install -r requirements.txt
synthetic-data-pipeline/
├── synthetic_data_pipeline.py
├── audio_tokenizer.py
├── tts_processor.py
├── requirements.txt
├── README.md
└── <save_dir>/
├── audio_tokens_*.csv
└── failed_indices_*.json
Edit the synthetic_generation_cfg.yaml
file to configure the following parameters:
# Dataset configuration
dataset:
name: # Dataset name from Hugging Face
split: # Dataset split to use
remaining_indices_file: # File to store remaining indices (List[int])
# Processing configuration
processing:
devices: # List of GPUs to use
num_procs_per_device: # Number of processes per GPU
save_dir: # Directory to save processed data
save_batch_size: # Batch size for saving processed data for each process
max_retries: # Maximum number of retries for processing a sample
sample_rate: # Sample rate for audio data
speaker: # Speaker class to use for speech generation
format: # Format for saving processed data
test_mode: # Whether to run the test
test:
num_samples: # Number of samples to process
devices: # List of GPUs to use
num_procs_per_device: # Number of processes per GPU
save_dir: # Directory to save processed data
save_batch_size: # Batch size for saving processed data for each process
max_retries: # Maximum number of retries for processing a sample
sample_rate: # Sample rate for audio data
speaker: # Speaker class to use for speech generation
# Logging configuration
logging:
log_file: # Log file name
console_level: # Console log level
file_level: # File log level
upload_to_s3: # Whether to upload processed data to S3
s3:
save_dir: # Directory to save processed data
bucket_name: # S3 bucket name
s3_folder: # S3 folder name
There are currently 3 supported speakers for text-to-speech conversion:
- default_speaker
- speaker_5304
There are currently 2 supported speakers for text-to-speech conversion:
- parquet
- csv
To run the full pipeline:
python synthetic_data_pipeline.py --config_file synthetic_generation_cfg.yaml
To run the pipeline in test mode with a smaller dataset. modify the test_mode
parameter in the configuration file:
...
test_mode: true
...
The TTSProcessor
class uses WhisperSpeech to convert text prompts into audio waveforms. It's initialized with a specific device (GPU) and can generate audio from text inputs.
The AudioTokenizer
class uses EncodecModel to process audio waveforms into tokenized representations. It can also decode audio tokens back into waveforms if needed.
graph TD
A[Load Dataset] --> B[Split into Chunks]
B --> C[Process Chunk]
subgraph PROC ["Process x N"]
C --> D[Text to Speech Conversion]
D --> E[Audio Tokenization]
E --> F[Save to CSV]
C --> G[Error Handling]
G -->|Retry| D
G -->|Max Retries Exceeded| H[Log Failed Indices]
end
F --> I[Combine Results]
H --> J[Aggregate Failed Indices]
classDef PROC padding-left:23em;
class PROC PROC;
This pipeline is built on top of the WhisperSpeech and Encodec libraries. We would like to thank the developers of these libraries for their contributions to the field of audio processing.