Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Silero-Vad V5 #884

Merged
merged 11 commits into from
Jul 1, 2024
Merged

Conversation

hoonlight
Copy link
Contributor

@hoonlight hoonlight commented Jun 28, 2024

343901647-5f581592-3035-411e-9803-44a2fda0af8e

https://github.com/snakers4/silero-vad/releases/tag/v5.0

  • The V5 model now only works with a fixed size window, so the window_size_samples parameter is removed and its value is fixed at 512.
  • Change to use the state variable instead of the existing h and c variables.
  • Slightly changed internal logic, now some context (part of previous chunk) is passed along with the current chunk.
  • Change the dimensions of the state variable from 64 to 128.
  • Replace ONNX file with V5 version

@trungkienbkhn
Copy link
Collaborator

@hoonlight , thanks for quickly adapting Silero-Vad V5 for fw after this model was released. Have you run benchmarks for it yet?

@hoonlight
Copy link
Contributor Author

hoonlight commented Jun 28, 2024

@hoonlight , thanks for quickly adapting Silero-Vad V5 for fw after this model was released. Have you run benchmarks for it yet?

No, I haven't run the benchmarks yet.
I found it a little while ago, thanks.

However, since I won't have access to a GPU for a while, I should be able to run the benchmarks in a couple weeks.

@Purfview
Copy link
Contributor

Purfview commented Jun 28, 2024

Thanks for adapting v5, I was just thinking to do it, good that I noticed your PR. 😄

Btw, about GPU, look there #499 (comment)

@trungkienbkhn
Copy link
Collaborator

For information, I ran benchmarks with GPU H100 and large-v3 model.
Below are the results:

1. Speed benchmark:
Processing audio with duration 13:19.231s
Detected language 'fr' with probability 1.00

System Min execution time
Faster-Whisper 41.413s
FW with SILERO VAD V5 39.529s

2. WER benchmark:
Dataset: librispeech_asr
Number of audio used for evaluation: 500

System WER
Faster-Whisper 3.139
FW with SILERO VAD V5 2.815

3. Memory benchmark:
GPU name: NVIDIA H100 PCIe
GPU device index: 0

System Maximum increase of RAM Maximum GPU memory usage Maximum GPU power usage
Faster-Whisper 1222 MiB 5107MiB / 81559MiB 145W / 350W
FW with SILERO VAD V5 1225 MiB 5107MiB / 81559MiB 149W / 350W

The results look good to me. Speed has improved a bit, as described in the VAD V5 model release:

3x faster inference for TorchScript, 10% faster inference for ONNX;

@trungkienbkhn trungkienbkhn merged commit 8d400e9 into SYSTRAN:master Jul 1, 2024
3 checks passed
@hoonlight hoonlight deleted the silero-vad-v5 branch July 1, 2024 10:45
shinlw added a commit to shinlw/faster-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants