Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] improve Binarize() performance #1721

Draft
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

benniekiss
Copy link
Contributor

While processing long audios in the SpeakerDiarization pipeline, I noticed that the to_annotation() method was taking a while, and I tracked it down to pyannote.audio.utils.signal.Binarize.__call__() where it was looping over a numpy array which could end up being quite large.

In my tests, the original implementation took about 60 seconds for a 9 hour audio. With this new implementation, it takes about 0.5 seconds.

I've only tested this with the SpeakerDiarization pipeline, but the new implementation returns the same results as the original.

@benniekiss
Copy link
Contributor Author

benniekiss commented Jul 14, 2024

Fixed an off-by-one error in the new method.

I also made a google colab notebook showcasing the improvements: https://colab.research.google.com/drive/1Me3GgQUPXxjuEn06DNVco_GIxlUoYPTE?usp=sharing

In summary, the new method has a slight speed up for fully synthetic data, a 2x speedup for discrete (0s and 1s) synthetic data, and an almost 100x speedup for real data in the SpeakerDiarization pipeline.

The notebook also lets you extend the real data sample to however many hours is desired under the TEST WITH REAL DATA section by setting AUDIO_LENGTH to the desired number of hours.

Data Type Original Method V2 Method
Synthetic = np.random.randn(100000, 50) 00:00:08.781 00:00:07.972
Synthetic Discrete = np.random.randint(0, 2, size=(100000, 50)) 00:00:19.085 00:00:10.724
Real Data - huggingface datasets (01:02:27.300 long audio) 00:00:00.755 00:00:00.008

EDIT: I realized that I did not test this with various offsets and onsets when initializing the Binarize class, and after doing so, the implementations are not equal. Will keep working on this to see if there's a way to make any improvements

@benniekiss benniekiss changed the title improve Binarize() performance [WIP] improve Binarize() performance Jul 14, 2024
@benniekiss benniekiss marked this pull request as draft July 31, 2024 21:12
* vectorized operations instead of nested loops
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant