- Seminar:
- CTC and RNN-T Hybrid model: Notebook
-
General:
- depthwise separable convolution explanation with a beautiful visualization here in case of conv2d
- comparing end-to-end speech recognition architectures in 2021 a blogpost comparing CTC, LAS and RNN-t models
- whisper paper - a large-scale ASR model with a sophisticated attention-based decoder trained on 680k hours of weakly supervised multilingual and multitask data from openai, released in 2022
-
LAS:
-
RNN-t:
- sequence-to-sequence learning with transducers a gentle introduction to RNN-t architecture
- original RNN-t decoder paper
-
RNN-t optimizations:
- fast conformer paper - a fast conv2d subsampling with depthwise separable convolutions, 8x time reduction and smaller kernel sizes for convolutions
- multi-blank transducers [paper](Multi-blank Transducers for Speech Recognition) - add a big blank token in the dictionary and predict it while there is a big pause then we will save computation time
- token-and-duration transducer paper - predict not blank or big blank tokens, but predict all tokens and its duration (nvidia using this tuned decoder in the biggest model - Parakeet-TDT 1.1B)
- RNN-t with stateless prediction network paper - replace lstm embeddings with embeddings from a simple lookup table (e.g. torch.nn.Embeddings)
- more about prediction network architectures here