Fast/low compute speaker diarization #1494

yehiaabdelm · 2023-10-10T07:55:23Z

I'm trying to add diarization to this repo https://github.com/collabora/WhisperLive, which has transcription and also runs a VAD model before passing audio data to the transcriber. I have it working, however, the VAD model and the diarization model both run on the CPU so they slow down each other. This greatly affects the quality of the transcription results and also slows down the transcriptions so they are no longer realtime. I was wondering if there is some way to speed things up. I was thinking of storing speaker embeddings and only processing the last n seconds, for example. Right now, I am processing the whole audio stream everytime so it will become slower as time goes on. Any suggestions are appreciated.

# diarizarion.py
from pyannote.audio import Pipeline
import torch
from intervaltree import IntervalTree


from dotenv import load_dotenv, find_dotenv
import os

load_dotenv(find_dotenv())
HUGGINGFACE_ACCESS_TOKEN = os.environ["HUGGINGFACE_ACCESS_TOKEN"]


class Diarization():
    def __init__(self):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.0",
            use_auth_token=HUGGINGFACE_ACCESS_TOKEN).to(device)

    def transform_diarization_output(self, diarization):
        l = []
        for segment, speaker in diarization.itertracks():
            l.append({"start": segment.start,
                     "end": segment.end, "speaker": speaker})
        return l

    def process(self, waveform, sample_rate):
        # convert samples to tensor
        audio_tensor = torch.tensor(waveform, dtype=torch.float32).unsqueeze(0)
        # run diarization model on tensor
        diarization = self.pipeline(
            {"waveform": audio_tensor, "sample_rate": sample_rate})
        # convert output to list of dicts
        diarization = self.transform_diarization_output(diarization)
        return diarization

    def join_transcript_with_diarization(self, transcript, diarization):

        diarization_tree = IntervalTree()
        # Add diarization to interval tree
        for dia in diarization:
            diarization_tree.addi(dia['start'], dia['end'], dia['speaker'])

        joined = []
        for seg in transcript:
            interval_start = seg['start']
            interval_end = seg['end']
            # Get overlapping diarization
            overlaps = diarization_tree[interval_start:interval_end]
            speakers = {overlap.data for overlap in overlaps}
            # Add to result
            joined.append({
                'start': interval_start,
                'end': interval_end,
                'speakers': list(speakers),
                'text': seg['text']
            })

        return joined

github-actions · 2023-10-10T07:55:42Z

Thank you for your issue.
We found the following entry in the FAQ which you may find helpful:

Does pyannote support streaming speaker diarization?

Feel free to close this issue if you found an answer in the FAQ.

If your issue is a feature request, please read this first and update your request accordingly, if needed.

If your issue is a bug report, please provide a minimum reproducible example as a link to a self-contained Google Colab notebook containing everthing needed to reproduce the bug:

installation
data preparation
model download
etc.

Providing an MRE will increase your chance of getting an answer from the community (either maintainers or other power users).

Companies relying on pyannote.audio in production may contact me via email regarding:

paid scientific consulting around speaker diarization and speech processing in general;
custom models and tailored features (via the local tech transfer office).

This is an automated reply, generated by FAQtory

hbredin · 2023-10-10T09:00:05Z

If you are looking for streaming diarization, you might want to have a look at @juanmc2005's diart toolkit which is (in part) based on pyannote.

yehiaabdelm · 2023-10-10T13:23:17Z

Will take a look, thanks.

yehiaabdelm closed this as completed Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast/low compute speaker diarization #1494

Fast/low compute speaker diarization #1494

yehiaabdelm commented Oct 10, 2023

github-actions bot commented Oct 10, 2023

hbredin commented Oct 10, 2023

yehiaabdelm commented Oct 10, 2023

Fast/low compute speaker diarization #1494

Fast/low compute speaker diarization #1494

Comments

yehiaabdelm commented Oct 10, 2023

github-actions bot commented Oct 10, 2023

hbredin commented Oct 10, 2023

yehiaabdelm commented Oct 10, 2023