-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange results for short audio clips #1
Comments
First i am very glad you enjoy our work! Do you have more details on your chunking algorithm perhaps so i can fully reproduce your issue? When directly transcribing the output using the from faster_whisper import WhisperModel
model = WhisperModel("nyrahealth/faster_CrisperWhisper", device="cuda", compute_type="float16")
segments, info = model.transcribe(
'Obama Convincingly Wins Debate In First 10 Minutes.mp3',
temperature=0,
beam_size=1,
language='en',
word_timestamps=True,
without_timestamps=True
)
for segment in segments:
print(segment) Which results in the following output that is a verbatim transcript of the spoken words as expected. Since the model is trained to transcribe the primary speaker the first few words of the interviewer are omitted. Timings look reasonable:
Here is my guess: Our model was finetuned with segments of less than 30 seconds without the prediction of the timestamp of the end of the segment. I think the longform transcription algorithm of faster whisper relies on that segment timestamp prediction ( the actual timestamp tokens) of the Whisper model.... so possibly something like this goes wrong in your algorithm with our model. But its hard to investigate without your full code. |
Thanks for the help, I've tried to put together a full reproducible example here. This doesn't include the rolling window (for the given file it will just keep concatting the audio data to the array) but will give the same outputs and errors.
|
Hi @LaurinmyReha , I just tried trasncribing a Japanese audio but the result segments do not contain any text --the text attribute is empty. Might this be because of the language? Thanks. |
Always hard to tell without the audio but i would not expect great things from this model in languages outside of german or english unfortunately since these are the only languages that we trained on after retokenization. |
Got it. Thanks! |
Do you have a suggestion for predicting the end of the segment so that it doesn't include pauses or breathing? Would post processing with forced alignment be the best way? |
Hi there,
First off, amazing job on your paper/the model! It looks super promising.
I'm working on a project where I'm attempting to do live streaming with Whisper. One of the challenges there is the inaccurate timestamps, as ideally you'd have something like a rolling time window while doing the transcriptions. Verbatim transcriptions would of course also be a very nice improvement. So of course I rushed to try out your code :).
When using your code like so with FasterWhisper
from faster_whisper import WhisperModel,
model_dir = os.path.join(base_dir, "models", "cache", "crisper_whisper")
model = WhisperModel(model_dir, device="cuda", compute_type="float16",download_root=cache_dir)
segments, info = model.transcribe(audio_array, initial_prompt="",
max_new_tokens=224,
beam_size=1,
language='en',
word_timestamps=True,
without_timestamps=True
)
for segment in segments:
print(segment)
I'm getting rather stranger results for short clips. For reference, I'm streaming this youtube video in 15 second rolling windows:
https://www.youtube.com/watch?v=kYnNSORARFk
And I'm getting results like:
"Like,Governor,I'm,I'm,I."
"You.The.The.Cold.War.War."
I'm also getting errors like:
File "c:\Users\tjong\Desktop\Audio_transcription_dev\whisper_streaming\test.py", line 375, in transcribe
for segment in segments:
File "C:\Users\tjong\Desktop\Audio_transcription_dev.venv\Lib\site-packages\faster_whisper\transcribe.py", line 1309, in generate_segments
self.add_word_timestamps(
File "C:\Users\tjong\Desktop\Audio_transcription_dev.venv\Lib\site-packages\faster_whisper\transcribe.py", line 1648, in add_word_timestamps
median_duration, max_duration = median_max_durations[segment_idx]
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
IndexError: list index out of range
The normal faster-whisper implementation performs fine (aside from the inaccuracies with the timestamps). I know that the accuracy for timestamps on faster-whisper isn't guaranteed, but I thought I'd give it a shot just to see how it compares to the base version. However, I'm not getting to the comparison of timestamps because the output seems like it's struggling. Do you have any idea what could be causing this?
The text was updated successfully, but these errors were encountered: