An advanced Automatic Speech Recognition (ASR) system for Chinese (Traditional) and Taiwanese, leveraging the power of OpenAI's Whisper model. This project supports full fine-tuning, Parameter-Efficient Fine-Tuning (PEFT), and streaming inference, optimized for T4 GPUs.
- 🎙️ Fine-tuning of Whisper models on Chinese/Taiwanese data
- 🚀 PEFT methods support (e.g., LoRA) for efficient fine-tuning
- 🔄 Batch and streaming inference capabilities
- 🖥️ User-friendly Gradio web interface
- ⚡ Optimized performance on T4 GPUs
ChineseTaiwaneseWhisper/
├── scripts/
│ ├── gradio_interface.py
│ ├── infer.py
│ └── train.py
├── src/
│ ├── config/
│ ├── crawler/
│ ├── data/
│ ├── models/
│ ├── trainers/
│ └── inference/
├── tests/
├── requirements.txt
├── setup.py
└── README.md
-
Clone the repository:
git clone https://github.com/sandy1990418/ChineseTaiwaneseWhisper.git cd ChineseTaiwaneseWhisper
-
Set up a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
python scripts/train.py --model_name_or_path "openai/whisper-small" \
--language "chinese" \
--dataset_name "common_voice_13_train" \
--youtube_data_dir "./youtube_data" \
--output_dir "./whisper-finetuned-zh-tw" \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--learning_rate 1e-5 \
--fp16 \
--timestamp False
python scripts/train.py --model_name_or_path "openai/whisper-small" \
--language "chinese" \
--use_peft \
--peft_method "lora" \
--dataset "common_voice_13_train, YOUR_CUSTOM_DATASET" \
--output_dir "Checkpoint_Path" \
--num_train_epochs 10 \
--per_device_train_batch_size 4 \
--learning_rate 1e-5 \
--fp16\
--timestamp True
Argument | Description | Default |
---|---|---|
--model_name_or_path |
Path or name of the pre-trained model | Required |
--language |
Language for fine-tuning (e.g., "chinese", "taiwanese") | Required |
--dataset_name |
Name of the dataset to use | Required |
--dataset_config_names |
Configuration name for the dataset | Required |
--youtube_data_dir |
Directory containing YouTube data | Optional |
--output_dir |
Directory to save the fine-tuned model | Required |
--num_train_epochs |
Number of training epochs | 3 |
--per_device_train_batch_size |
Batch size per GPU/CPU for training | 16 |
--learning_rate |
Initial learning rate | 3e-5 |
--fp16 |
Use mixed precision training | False |
--use_timestamps |
Include timestamp information in training | False |
--use_peft |
Use Parameter-Efficient Fine-Tuning | False |
--peft_method |
PEFT method to use (e.g., "lora") | None |
Launch the interactive web interface:
python scripts/gradio_interface.py
Access the interface at http://127.0.0.1:7860
(default URL).
Note: For streaming mode, use Chrome instead of Safari to avoid CPU memory issues.
python scripts/infer.py --model_path openai/whisper-small \
--audio_files audio.wav \
--mode batch \
--use_timestamps False
Argument | Description | Default |
---|---|---|
--model_path |
Path to the fine-tuned model | Required |
--audio_files |
Path(s) to audio file(s) for transcription | Required |
--mode |
Inference mode ("batch" or "stream") | "batch" |
--use_timestamps |
Include timestamps in transcription | False |
--device |
Device to use for inference (e.g., "cuda", "cpu") | "cuda" if available, else "cpu" |
--output_dir |
Directory to save transcription results | "output" |
--use_peft |
Use PEFT model for inference | False |
--language |
Language of the audio (e.g., "chinese", "taiwanese") | "chinese" |
Collect YouTube data:
python src/crawler/youtube_crawler.py \
--playlist_urls "YOUTUBE_PLAYLIST_URL" \
--output_dir ./youtube_data \
--dataset_name youtube_asr_dataset \
--file_prefix language_prefix
Argument | Description | Default |
---|---|---|
--playlist_urls |
YouTube playlist URL(s) to crawl | Required |
--output_dir |
Directory to save audio files and dataset | "./output" |
--dataset_name |
Name of the output dataset file | "youtube_dataset" |
--file_prefix |
Prefix for audio and subtitle files | "youtube" |
- 📊 Use different datasets by modifying the
dataset_name
parameter - 🛠️ Adjust PEFT methods via
peft_method
and configurations insrc/config/train_config.py
- 🔬 Optimize inference by modifying
ChineseTaiwaneseASRInference
insrc/inference/flexible_inference.py
Run tests with pytest:
pytest tests/
For detailed output:
pytest -v tests/
Check test coverage:
pip install pytest-cov
pytest --cov=src tests/
On a T4 GPU, without any acceleration methods:
- Inference speed: 1:24 (1 minute of processing time can transcribe 24 minutes of audio)
This baseline gives you an idea of the default performance. Depending on your specific needs, you may want to optimize further or use acceleration techniques.
To address memory issues or improve performance on T4 GPUs:
- 📉 Reduce batch size (
--per_device_train_batch_size
)- Decreases memory usage but may increase processing time
- 🔽 Use a smaller Whisper model (e.g., "openai/whisper-tiny")
- Faster inference but potentially lower accuracy
- 📈 Increase gradient accumulation steps (
--gradient_accumulation_steps
)- Simulates larger batch sizes without increasing memory usage
- 🔀 Enable mixed precision training (
--fp16
)- Speeds up computation and reduces memory usage with minimal impact on accuracy
For further performance improvements:
- 🚀 Use PEFT methods like LoRA
- Significantly reduces memory usage and training time
- ⚡ Implement quantization (e.g., int8)
- Dramatically reduces model size and increases inference speed
- 🖥️ Utilize multi-GPU setups if available
- Distributes computation for faster processing
Note: The actual performance may vary depending on your specific hardware, audio complexity, and chosen optimization techniques. Always benchmark your specific use case.
- Set Up: We prepare our system to listen and transcribe.
- Listen: We constantly listen for incoming audio.
- Check: When we get audio, we check if it contains speech.
- Process:
- If there's speech, we transcribe it.
- If not, we skip that part.
- Share: We immediately share what we found, whether it's words or silence.
- Repeat: We keep doing this until there's no more audio.
- Finish: When the audio ends, we wrap everything up and provide the final transcript.
graph TD
A[Start] --> B[Set Up System]
B --> C{Listen for Audio}
C -->|Audio Received| D[Check for Speech]
D -->|Speech Found| E[Transcribe Audio]
D -->|No Speech| F[Skip Transcription]
E --> G[Output Result]
F --> G
G --> C
C -->|No More Audio| H[Finish Up]
H --> I[End]
graph TD
A[Start] --> B[Initialize Audio Stream]
B --> C[Initialize ASR Model]
C --> D[Initialize VAD Model]
D --> E[Initialize Audio Buffer]
E --> F[Initialize ThreadPoolExecutor]
F --> G{Receive Audio Chunk}
G -->|Yes| H[Add to Audio Buffer]
H --> I{Buffer Full?}
I -->|No| G
I -->|Yes| J[Submit Chunk to ThreadPool]
J --> K[Apply VAD]
K --> L{Speech Detected?}
L -->|No| O[Slide Buffer]
L -->|Yes| M[Process Audio Chunk]
M --> N[Generate Partial Transcription]
N --> O
O --> G
G -->|No| P[Process Remaining Audio]
P --> Q[Finalize Transcription]
Q --> R[End]
subgraph "Parallel Processing"
J
K
L
M
N
end
The Chinese/Taiwanese Whisper ASR project uses a specific format for its datasets to ensure compatibility with the training and inference scripts. The format can include or exclude timestamps, depending on the configuration.
Each item in the dataset represents an audio file and its corresponding transcription:
{
"audio": {
"path": "path/to/audio/file.wav",
"sampling_rate": 16000
},
"sentence": "The transcription of the audio in Chinese or Taiwanese.",
"language": "zh-TW", # or "zh-CN" for Mandarin, "nan" for Taiwanese, etc.
"duration": 10.5 # Duration of the audio in seconds
}
Use the dataset_info.json
to determine which dataset you want to use, the structure of dataset_info.json
is below:
{
"common_voice_13_train": {
"hf_hub_url": "mozilla-foundation/common_voice_13_0",
"columns": {
"audio": "audio",
"target": "sentence",
"language": "chinese"
},
"dataset_kwargs": {
"split": "train"
},
"dataset_args": [
"zh-TW"
]
},
# If you have a custom dataset that you want to train, please use the following format in the dataset_info.json
"YOUR_CUSTOM_DATASET": {
"file_name": "YOUR_CUSTOM_DATASET.json",
"columns": {
"audio": "audio_path",
"target": "timestamp", # change to `sentence` if you want to train a model without timestamp
"language": "YOUR_DATASET_LANGUEGE" # please check which languages are be used in Huggingface.
},
"dataset_kwargs": {
"split": "train"
}
},
}
labels:
<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>地圖炮<|endoftext|>
In this example:
<|startoftranscript|>
: Marks the beginning of the transcription<|zh|>
: Indicates the language (Chinese)<|transcribe|>
: Denotes that this is a transcription task<|notimestamps|>
: Indicates that no timestamps are included地圖炮
: The actual transcription<|endoftext|>
: Marks the end of the transcription
labels:
<|startoftranscript|><|zh|><|transcribe|><|0.00|>而對樓市成交抑制作用最大的限購<|6.00|><|endoftext|>
In this example:
<|startoftranscript|>
,<|zh|>
, and<|transcribe|>
: Same as above<|0.00|>
: Timestamp indicating the start of the transcription (0 seconds)而對樓市成交抑制作用最大的限購
: The actual transcription<|6.00|>
: Timestamp indicating the end of the transcription (6 seconds)<|endoftext|>
: Marks the end of the transcription
- The choice between using timestamps or not should be consistent throughout your dataset and should match the
use_timestamps
parameter in your training and inference scripts.
If you're preparing your own dataset:
- Organize your audio files and transcriptions.
- Ensure each transcription includes the appropriate tokens (
<|startoftranscript|>
,<|zh|>
, etc.). - If using timestamps, include them in the format
<|seconds.decimals|>
before each segment of transcription. - Use
<|notimestamps|>
if not including timestamp information. - Always end the transcription with
<|endoftext|>
.
By following this format, you ensure that your dataset is compatible with the Chinese/Taiwanese Whisper ASR system, allowing for efficient training and accurate inference.
-
Development Mode:
fastapi dev api_main.py
-
Production Mode:
fastapi run api_main.py
The API will be accessible at http://0.0.0.0:8000
by default.
- Build and start the Docker container:
bash app/docker.sh
Access the Swagger UI documentation at http://localhost:8000/docs
when the server is running.
-
Health Check:
curl -k http://localhost:8000/health
-
Transcribe Audio:
curl -k -X POST -H "Content-Type: multipart/form-data" -F "file=@/path/to/your/audio/file.wav" http://localhost:8000/transcribe
Replace
/path/to/your/audio/file.wav
with the actual path to your audio file. -
List All Transcriptions:
curl -k http://localhost:8000/transcriptions
-
Get a Specific Transcription:
curl -k http://localhost:8000/transcription/{transcription_id}
Replace
{transcription_id}
with the actual UUID of the transcription. -
Delete a Transcription:
curl -k -X DELETE http://localhost:8000/transcription/{transcription_id}
Replace
{transcription_id}
with the UUID of the transcription you want to delete.
This project is licensed under the MIT License. See the LICENSE file for details.
- OpenAI for the Whisper model
- Hugging Face for the Transformers library
- Mozilla Common Voice for the dataset