LingWav2Vec2 is a novel approach for Vietnamese mispronunciation detection, combining a pre-trained wav2vec 2.0 model with a linguistic encoder. This project achieved top rank in the Vietnamese Mispronunciation Detection (VMD) challenge at VLSP 2023.
- Improve Vietnamese mispronunciation detection and diagnosis (MD&D)
- Address challenges in mispronunciation detection due to limited training data
- Leverage both acoustic and linguistic information for a balanced approach
- Combines wav2vec 2.0 with a linguistic encoder
- Processes raw audio input
- Utilizes canonical phoneme information
- Only 4.3M additional parameters on top of wav2vec 2.0
- Achieved top-rank on VLSP private test leaderboard
- F1-score of 59.68%, a 9.72% improvement over previous state-of-the-art
- Outperformed more complex models (e.g., TextGateContrast) with fewer parameters
- Balanced use of canonical linguistic information (27.63% relative difference in accuracy)
- Non-freezing wav2vec 2.0 CNN layers yielded optimal results
- SpecAugment with specific parameters achieved best F1-score
- Linguistic Encoder significantly boosted performance
- Explore MD&D-specific data augmentation
- Investigate impact of pitch information on Vietnamese mispronunciation detection
If you use this work, please cite our paper.
For questions or collaborations, please contact:
- Tuan Nguyen (Institute for Infocomm Research (I²R), A*STAR, Singapore - nvatuan3@gmail.com)
- Huy Dat Tran (Institute for Infocomm Research (I²R), A*STAR, Singapore).
This work will be poster presented at INTERSPEECH 2024.