- Speaker diarization labels who said what in a transcript (e.g. Speaker A, Speaker B …). It is essential for conversation transcripts like meetings or podcasts.
- tinydiarize aims to be a minimal, interpretable extension of OpenAI's Whisper models that adds speaker diarization with few extra dependencies (inspired by minGPT).
- This uses a finetuned model that adds special tokens to mark speaker changes [1,2,3,4]. It can use both voice and semantic context to tell speakers apart, which is a unique benefit of this approach.
- You can refer to tdrz_dev for a detailed analysis of performance. Note that this is intended to be a prototype/proof-of-concept.
- Experimental support is also added to whisper.cpp so this can run on consumer hardware like MacBooks and iPhones. A tiny change is needed to original inference code (<50 lines), enabling simple and cheap speaker segmentation, compared with conventional approaches.
demo_video-trim.mp4
You can try it out on other such gems from YouTube using this notebook.
Install ffmpeg
following the original repo, then run:
pip install -e .
whisper --model small.en-tdrz AUDIO
The only change is the small.en-tdrz
model instead of small.en
. That's it! 🎉
- Finetuned checkpoint for the
small.en-tdrz
model (located here) and example inference code (relevant edits in [#4] [#11]). This has the same dependencies as the original whisper repo. - Tools for comparison and analysis (under /tdrz_dev):
- A scoring tool to measure and compare accuracy on your own data in an easy to interpret way.
- A reference script to run and compare various diarization pipelines.
- A Jupyter notebook to compare and understand performance in detail.
- See Roadmap for more info.
We aim to demonstrate a starting point enabling anyone (or even OpenAI themselves!) to improve performance and extend support (multilingual, speech translation etc.).
metric | small.en | small.en-tdrz |
---|---|---|
spk_turn_precision | - | 97.7 |
spk_turn_recall | - | 70.8 |
wer_overall | 11.0 | 10.3 |
wer_speaker_switch | 15.0 | 15.5 |
On a (tiny) benchmark set of 3 earnings calls, tdrz
gets near-perfect speaker turn precision at fairly decent recall. A similar WER is retained as the original model. Not too shabby for a tiny finetuning setup, and <10% extra inference cost!
Refer to tdrz_dev for details on performance analysis and comparisons.
- Whisper
small.en
checkpoints were finetuned on ~100hrs of AMI meetings using HuggingFace Transformers and Datasets. - With some tricks, this could be done relatively cheaply with just 30mins of 1 GPU training starting to produce decent results. Tiny indeed 😊.
- We used helpful tools from pyannote (the OG open-source diarization toolkit) for finetuning data preparation and also analyze its performance.
- We make use of the excellent open-source revdotcom/fstalign tool for scoring and analysis.
Note that this still an early proof-of-concept and there are a few things to be aware of:
- Only the
small.en
English model has been finetuned. - Word-error-rate (WER) is close to original models, although not yet extensively tested. Ad-hoc inspection does show some differences in timestamp behavior (longer segments) or deletion errors. See the notebook under tdrz_dev for details.
- Given a pretty tiny finetuning setup, there's likely a lot of room for further accuracy improvements.
- Only local diarization (segmentation into speaker turns) is handled so far. Extension with global diarization (speaker clustering) is planned for later.
- Stuff is still hacky and subject to change, so hold your horses just yet! 🐎
- inference code & demo
- scoring and analysis tools
- whisper.cpp integration
- reproducible dataprep + finetuning*
- blog post explainer*
- HuggingFace integration
- better LoRa-based
small.en
checkpoint - possibly clustering with NME-SC?
- possibly
large-v2
checkpoint?
* is a pointer to the current state of the repo. Please see #14 for an update on plans. TLDR; things have had to be put on pause :/
[1] Joint Speech Recognition and Speaker Diarization via Sequence Transduction [2] Serialized Output Training for End-to-End Overlapped Speech Recognition [3] Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection [4] Adapting Multi-Lingual ASR Models for Handling Multiple Talkers
For information on the underlying Whisper model, please refer to the original documentation (release: 20230308
)
Code and model weights are released under the MIT License. See LICENSE for further details.
If you please to use this in your research, you can cite this work as
@software{mahajan2023tinydiarize,
author = {Mahajan, Akash},
month = {08},
title = {tinydiarize: Minimal extension of Whisper for speaker segmentation with special tokens},
url = {https://github.com/akashmjn/tinyDiarize},
year = {2023}
}