Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange results for short audio clips #1

Open
tjongsma opened this issue Sep 3, 2024 · 6 comments
Open

Strange results for short audio clips #1

tjongsma opened this issue Sep 3, 2024 · 6 comments

Comments

@tjongsma
Copy link

tjongsma commented Sep 3, 2024

Hi there,

First off, amazing job on your paper/the model! It looks super promising.

I'm working on a project where I'm attempting to do live streaming with Whisper. One of the challenges there is the inaccurate timestamps, as ideally you'd have something like a rolling time window while doing the transcriptions. Verbatim transcriptions would of course also be a very nice improvement. So of course I rushed to try out your code :).

When using your code like so with FasterWhisper

from faster_whisper import WhisperModel,
model_dir = os.path.join(base_dir, "models", "cache", "crisper_whisper")
model = WhisperModel(model_dir, device="cuda", compute_type="float16",download_root=cache_dir)

segments, info = model.transcribe(audio_array, initial_prompt="",
max_new_tokens=224,
beam_size=1,
language='en',
word_timestamps=True,
without_timestamps=True
)
for segment in segments:
print(segment)
I'm getting rather stranger results for short clips. For reference, I'm streaming this youtube video in 15 second rolling windows:
https://www.youtube.com/watch?v=kYnNSORARFk
And I'm getting results like:
"Like,Governor,I'm,I'm,I."
"You.The.The.Cold.War.War."

I'm also getting errors like:
File "c:\Users\tjong\Desktop\Audio_transcription_dev\whisper_streaming\test.py", line 375, in transcribe
for segment in segments:
File "C:\Users\tjong\Desktop\Audio_transcription_dev.venv\Lib\site-packages\faster_whisper\transcribe.py", line 1309, in generate_segments
self.add_word_timestamps(
File "C:\Users\tjong\Desktop\Audio_transcription_dev.venv\Lib\site-packages\faster_whisper\transcribe.py", line 1648, in add_word_timestamps
median_duration, max_duration = median_max_durations[segment_idx]
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
IndexError: list index out of range

The normal faster-whisper implementation performs fine (aside from the inaccuracies with the timestamps). I know that the accuracy for timestamps on faster-whisper isn't guaranteed, but I thought I'd give it a shot just to see how it compares to the base version. However, I'm not getting to the comparison of timestamps because the output seems like it's struggling. Do you have any idea what could be causing this?

@LaurinmyReha
Copy link
Contributor

LaurinmyReha commented Sep 6, 2024

First i am very glad you enjoy our work!

Do you have more details on your chunking algorithm perhaps so i can fully reproduce your issue?

When directly transcribing the output using the faster_CrisperWhisper model i used the following code:

from faster_whisper import WhisperModel

model = WhisperModel("nyrahealth/faster_CrisperWhisper", device="cuda", compute_type="float16")

segments, info = model.transcribe(
    'Obama Convincingly Wins Debate In First 10 Minutes.mp3', 
    temperature=0,
    beam_size=1,
    language='en',
    word_timestamps=True,
    without_timestamps=True
)

for segment in segments:
    print(segment)

Which results in the following output that is a verbatim transcript of the spoken words as expected. Since the model is trained to transcribe the primary speaker the first few words of the interviewer are omitted. Timings look reasonable:

Segment(id=1, seek=2932, start=0.0, end=29.32, text=" To give the president a chance [UH] Governor Romney, I'm glad that you recognize that Al Qaeda is a threat because a few months ago when you were asked what's the biggest geopolitical threat facing America, you said Russia, not Al Qaeda. You said Russia in the 1980s are now calling to ask for their foreign policy back because you know the Cold War has been over for 20 years. But governor, you know, when it comes to our foreign policy, you seem to want to import the foreign policies", tokens=[220, 1407, 220, 976, 220, 264, 220, 3868, 220, 64, 220, 2931, 220, 35007, 220, 38, 670, 6051, 220, 10141, 2397, 11, 220, 40, 478, 220, 5404, 220, 300, 220, 291, 220, 265, 46521, 77, 1125, 220, 300, 220, 967, 220, 48, 43973, 220, 271, 220, 64, 220, 4734, 220, 570, 220, 64, 220, 1326, 220, 2493, 220, 2057, 220, 562, 220, 291, 220, 645, 220, 2351, 220, 437, 311, 220, 264, 220, 65, 6249, 377, 220, 46615, 804, 220, 4734, 220, 7170, 220, 22597, 2262, 11, 220, 291, 220, 848, 220, 6797, 11, 220, 572, 83, 220, 967, 220, 48, 43973, 13, 220, 509, 220, 848, 220, 6797, 220, 259, 220, 264, 220, 13626, 82, 220, 366, 220, 572, 86, 220, 66, 24021, 220, 281, 220, 1029, 220, 337, 220, 641, 220, 726, 20350, 220, 714, 75, 2632, 220, 4231, 547, 220, 570, 220, 291, 220, 74, 572, 86, 220, 264, 220, 16918, 220, 3630, 220, 575, 220, 668, 220, 670, 220, 337, 220, 945, 220, 924, 13, 220, 583, 220, 27526, 6051, 11, 220, 291, 220, 74, 572, 86, 11, 220, 562, 220, 270, 220, 598, 3813, 220, 281, 220, 396, 220, 726, 20350, 220, 714, 75, 2632, 11, 220, 291, 220, 1643, 220, 281, 220, 528, 220, 281, 220, 332, 1515, 83, 220, 264, 220, 726, 20350, 220, 714, 1050, 530], temperature=0, avg_logprob=-0.04097238487667507, compression_ratio=1.6758620689655173, no_speech_prob=3.993511199951172e-06, words=[Word(start=0.0, end=0.12, word=' To', probability=0.39923095703125), Word(start=0.12, end=0.4, word=' give', probability=0.97802734375), Word(start=0.4, end=0.62, word=' the', probability=0.990234375), Word(start=0.62, end=0.94, word=' president', probability=0.96484375), Word(start=0.94, end=1.14, word=' a', probability=0.98291015625), Word(start=1.14, end=1.32, word=' chance', probability=0.99462890625), Word(start=1.32, end=1.34, word=' [UH]', probability=0.64404296875), Word(start=1.34, end=1.68, word=' Governor', probability=0.9609375), Word(start=1.68, end=1.96, word=' Romney,', probability=0.9720052083333334), Word(start=2.04, end=2.14, word=" I'm", probability=0.9998372395833334), Word(start=2.14, end=2.52, word=' glad', probability=1.0), Word(start=2.52, end=2.82, word=' that', probability=0.985595703125), Word(start=2.82, end=3.02, word=' you', probability=1.0), Word(start=3.02, end=3.56, word=' recognize', probability=0.99501953125), Word(start=3.56, end=3.7, word=' that', probability=0.998779296875), Word(start=3.7, end=3.8, word=' Al', probability=0.844482421875), Word(start=3.8, end=4.02, word=' Qaeda', probability=0.7745768229166666), Word(start=4.02, end=4.2, word=' is', probability=0.97705078125), Word(start=4.2, end=4.22, word=' a', probability=0.999267578125), Word(start=4.22, end=4.58, word=' threat', probability=0.999755859375), Word(start=4.58, end=5.08, word=' because', probability=0.912353515625), Word(start=5.08, end=5.18, word=' a', probability=0.996826171875), Word(start=5.18, end=5.34, word=' few', probability=1.0), Word(start=5.34, end=5.6, word=' months', probability=0.999755859375), Word(start=5.6, end=5.84, word=' ago', probability=1.0), Word(start=5.84, end=6.02, word=' when', probability=0.953369140625), Word(start=6.02, end=6.12, word=' you', probability=0.999755859375), Word(start=6.12, end=6.22, word=' were', probability=0.999755859375), Word(start=6.22, end=6.38, word=' asked', probability=0.99951171875), Word(start=6.38, end=6.68, word=" what's", probability=0.9755859375), Word(start=6.68, end=7.1, word=' the', probability=0.9990234375), Word(start=7.1, end=7.46, word=' biggest', probability=0.999755859375), Word(start=7.46, end=8.08, word=' geopolitical', probability=0.99560546875), Word(start=8.08, end=8.34, word=' threat', probability=0.99951171875), Word(start=8.34, end=8.68, word=' facing', probability=0.99951171875), Word(start=8.68, end=9.02, word=' America,', probability=0.9991861979166666), Word(start=9.04, end=9.12, word=' you', probability=0.999755859375), Word(start=9.12, end=9.32, word=' said', probability=1.0), Word(start=9.32, end=9.7, word=' Russia,', probability=0.997802734375), Word(start=9.7, end=10.76, word=' not', probability=0.9889322916666666), Word(start=10.76, end=10.88, word=' Al', probability=0.958251953125), Word(start=10.88, end=11.16, word=' Qaeda.', probability=0.99267578125), Word(start=11.24, end=11.66, word=' You', probability=0.993896484375), Word(start=11.66, end=11.9, word=' said', probability=1.0), Word(start=11.9, end=12.3, word=' Russia', probability=0.99853515625), Word(start=12.3, end=12.8, word=' in', probability=0.87158203125), Word(start=12.8, end=12.9, word=' the', probability=0.999755859375), Word(start=12.9, end=13.5, word=' 1980s', probability=0.98681640625), Word(start=13.5, end=13.58, word=' are', probability=0.968505859375), Word(start=13.58, end=13.94, word=' now', probability=1.0), Word(start=13.94, end=14.68, word=' calling', probability=0.9996744791666666), Word(start=14.68, end=14.8, word=' to', probability=0.987548828125), Word(start=14.8, end=14.92, word=' ask', probability=0.999755859375), Word(start=14.92, end=15.04, word=' for', probability=1.0), Word(start=15.04, end=15.18, word=' their', probability=0.986328125), Word(start=15.18, end=15.42, word=' foreign', probability=0.9934895833333334), Word(start=15.42, end=15.7, word=' policy', probability=1.0), Word(start=15.7, end=16.1, word=' back', probability=0.994140625), Word(start=16.1, end=16.72, word=' because', probability=0.970703125), Word(start=16.72, end=17.22, word=' you', probability=0.879638671875), Word(start=17.22, end=17.42, word=' know', probability=0.9998779296875), Word(start=17.42, end=17.54, word=' the', probability=0.83154296875), Word(start=17.54, end=17.78, word=' Cold', probability=0.87255859375), Word(start=17.78, end=17.96, word=' War', probability=1.0), Word(start=17.96, end=18.06, word=' has', probability=0.9443359375), Word(start=18.06, end=18.18, word=' been', probability=1.0), Word(start=18.18, end=18.42, word=' over', probability=1.0), Word(start=18.42, end=18.6, word=' for', probability=0.99951171875), Word(start=18.6, end=18.86, word=' 20', probability=0.9990234375), Word(start=18.86, end=19.24, word=' years.', probability=1.0), Word(start=19.74, end=20.22, word=' But', probability=1.0), Word(start=20.22, end=20.86, word=' governor,', probability=0.8761393229166666), Word(start=20.86, end=21.16, word=' you', probability=0.997314453125), Word(start=21.16, end=21.34, word=' know,', probability=0.9998779296875), Word(start=21.34, end=21.44, word=' when', probability=0.99951171875), Word(start=21.44, end=21.58, word=' it', probability=1.0), Word(start=21.58, end=21.78, word=' comes', probability=1.0), Word(start=21.78, end=22.14, word=' to', probability=0.999755859375), Word(start=22.14, end=22.76, word=' our', probability=0.99853515625), Word(start=22.76, end=23.04, word=' foreign', probability=0.99951171875), Word(start=23.04, end=23.52, word=' policy,', probability=1.0), Word(start=23.56, end=24.06, word=' you', probability=1.0), Word(start=24.06, end=24.28, word=' seem', probability=0.99853515625), Word(start=24.28, end=24.38, word=' to', probability=1.0), Word(start=24.38, end=24.54, word=' want', probability=0.999267578125), Word(start=24.54, end=24.66, word=' to', probability=0.999755859375), Word(start=24.66, end=25.12, word=' import', probability=0.9991455078125), Word(start=25.12, end=25.46, word=' the', probability=0.98974609375), Word(start=25.46, end=25.74, word=' foreign', probability=0.9988606770833334), Word(start=25.74, end=29.32, word=' policies', probability=0.9722900390625)])
Segment(id=2, seek=5922, start=29.32, end=59.22, text="of the 1950s and the economic policies of the 1920s. Every time you've offered an opinion, you've been wrong. You said we should have gone into Iraq despite the fact that there were no weapons of mass destruction. You said that we should still have troops in Iraq to this day. You indicated that [UH] we shouldn't be passing [UH] nuclear [UH] treaties with Russia despite the fact that 71 senators, democrats and republicans, voted for it. You", tokens=[295, 220, 264, 220, 18141, 82, 220, 293, 220, 264, 220, 4836, 220, 714, 1050, 530, 220, 295, 220, 264, 220, 22003, 82, 13, 220, 2048, 220, 565, 220, 291, 600, 220, 766, 4073, 220, 282, 220, 4800, 11, 220, 291, 600, 220, 668, 220, 2085, 13, 220, 509, 220, 848, 220, 321, 220, 820, 220, 362, 220, 2780, 220, 666, 220, 11818, 220, 67, 7089, 642, 220, 264, 220, 1186, 220, 300, 220, 456, 220, 645, 220, 572, 220, 7463, 82, 220, 295, 220, 2758, 220, 13563, 13, 220, 509, 220, 848, 220, 300, 220, 321, 220, 820, 220, 920, 220, 362, 220, 11522, 220, 259, 220, 11818, 220, 281, 220, 341, 220, 1120, 88, 13, 220, 509, 220, 16176, 220, 300, 220, 35007, 220, 321, 220, 4659, 380, 220, 312, 220, 8437, 220, 35007, 220, 8179, 220, 35007, 220, 48552, 220, 365, 220, 6797, 220, 67, 7089, 642, 220, 264, 220, 1186, 220, 300, 220, 29985, 220, 32221, 11, 220, 47665, 220, 293, 220, 1085, 84, 888, 34332, 11, 220, 1650, 14727, 220, 337, 220, 270, 13, 220, 509], temperature=0, avg_logprob=-0.030384839297487184, compression_ratio=1.6844106463878328, no_speech_prob=3.039836883544922e-06, words=[Word(start=29.32, end=29.32, word='of', probability=0.0328369140625), Word(start=29.32, end=29.42, word=' the', probability=0.939208984375), Word(start=29.42, end=30.22, word=' 1950s', probability=0.9275716145833334), Word(start=30.22, end=30.52, word=' and', probability=0.98193359375), Word(start=30.52, end=30.6, word=' the', probability=0.9755859375), Word(start=30.6, end=31.0, word=' economic', probability=0.99658203125), Word(start=31.0, end=31.36, word=' policies', probability=0.985595703125), Word(start=31.36, end=31.48, word=' of', probability=0.818603515625), Word(start=31.48, end=31.52, word=' the', probability=0.9990234375), Word(start=31.52, end=32.22, word=' 1920s.', probability=0.9791666666666666), Word(start=32.22, end=32.74, word=' Every', probability=0.99755859375), Word(start=32.74, end=32.98, word=' time', probability=0.99853515625), Word(start=32.98, end=33.16, word=" you've", probability=0.9934895833333334), Word(start=33.16, end=33.4, word=' offered', probability=0.99951171875), Word(start=33.4, end=33.5, word=' an', probability=0.999755859375), Word(start=33.5, end=33.92, word=' opinion,', probability=1.0), Word(start=33.92, end=34.96, word=" you've", probability=0.9972330729166666), Word(start=34.96, end=35.06, word=' been', probability=0.999755859375), Word(start=35.06, end=35.38, word=' wrong.', probability=0.99951171875), Word(start=35.56, end=36.08, word=' You', probability=0.9990234375), Word(start=36.08, end=36.3, word=' said', probability=0.999267578125), Word(start=36.3, end=36.38, word=' we', probability=0.9931640625), Word(start=36.38, end=36.56, word=' should', probability=0.999755859375), Word(start=36.56, end=36.66, word=' have', probability=0.92578125), Word(start=36.66, end=36.86, word=' gone', probability=1.0), Word(start=36.86, end=37.08, word=' into', probability=0.998779296875), Word(start=37.08, end=37.48, word=' Iraq', probability=0.998291015625), Word(start=37.48, end=38.72, word=' despite', probability=0.87322998046875), Word(start=38.72, end=38.84, word=' the', probability=0.999755859375), Word(start=38.84, end=39.06, word=' fact', probability=1.0), Word(start=39.06, end=39.18, word=' that', probability=0.996337890625), Word(start=39.18, end=39.26, word=' there', probability=0.999267578125), Word(start=39.26, end=39.38, word=' were', probability=0.99853515625), Word(start=39.38, end=39.52, word=' no', probability=0.99951171875), Word(start=39.52, end=40.4, word=' weapons', probability=0.97265625), Word(start=40.4, end=40.48, word=' of', probability=0.99658203125), Word(start=40.48, end=40.68, word=' mass', probability=0.999755859375), Word(start=40.68, end=41.18, word=' destruction.', probability=0.99951171875), Word(start=41.7, end=42.22, word=' You', probability=0.998779296875), Word(start=42.22, end=42.54, word=' said', probability=0.999755859375), Word(start=42.54, end=42.9, word=' that', probability=0.997802734375), Word(start=42.9, end=43.02, word=' we', probability=0.999755859375), Word(start=43.02, end=43.22, word=' should', probability=0.999755859375), Word(start=43.22, end=43.48, word=' still', probability=0.999755859375), Word(start=43.48, end=43.68, word=' have', probability=0.999755859375), Word(start=43.68, end=44.0, word=' troops', probability=0.999755859375), Word(start=44.0, end=44.08, word=' in', probability=0.999755859375), Word(start=44.08, end=44.38, word=' Iraq', probability=0.99951171875), Word(start=44.38, end=44.64, word=' to', probability=0.747802734375), Word(start=44.64, end=44.9, word=' this', probability=1.0), Word(start=44.9, end=45.24, word=' day.', probability=0.9973958333333334), Word(start=46.18, end=46.7, word=' You', probability=0.998779296875), Word(start=46.7, end=47.26, word=' indicated', probability=0.99951171875), Word(start=47.26, end=47.84, word=' that', probability=0.99951171875), Word(start=47.84, end=48.52, word=' [UH]', probability=0.897216796875), Word(start=48.52, end=48.7, word=' we', probability=0.990966796875), Word(start=48.7, end=49.08, word=" shouldn't", probability=0.9964192708333334), Word(start=49.08, end=49.44, word=' be', probability=1.0), Word(start=49.44, end=50.22, word=' passing', probability=0.999755859375), Word(start=50.22, end=50.72, word=' [UH]', probability=0.890869140625), Word(start=50.72, end=51.54, word=' nuclear', probability=0.996826171875), Word(start=51.54, end=51.94, word=' [UH]', probability=0.99609375), Word(start=51.94, end=52.58, word=' treaties', probability=0.9931640625), Word(start=52.58, end=52.78, word=' with', probability=0.99951171875), Word(start=52.78, end=53.12, word=' Russia', probability=0.99951171875), Word(start=53.12, end=53.62, word=' despite', probability=0.85980224609375), Word(start=53.62, end=53.72, word=' the', probability=1.0), Word(start=53.72, end=53.92, word=' fact', probability=1.0), Word(start=53.92, end=54.08, word=' that', probability=1.0), Word(start=54.08, end=54.78, word=' 71', probability=0.933837890625), Word(start=54.78, end=55.8, word=' senators,', probability=0.984619140625), Word(start=55.8, end=56.66, word=' democrats', probability=0.768310546875), Word(start=56.66, end=56.88, word=' and', probability=0.97119140625), Word(start=56.88, end=57.46, word=' republicans,', probability=0.9955078125), Word(start=57.48, end=57.96, word=' voted', probability=0.9998372395833334), Word(start=57.96, end=58.16, word=' for', probability=1.0), Word(start=58.16, end=58.34, word=' it.', probability=0.9794921875), Word(start=58.7, end=59.22, word=' You', probability=0.99560546875)])
Segment(id=3, seek=7348, start=59.22, end=73.48, text='said that first, we should not have a timeline in Afghanistan. Then you said we should. Now you say maybe or it depends [UH] which means not only were you wrong, but you were also confusing and sending mixed messages both to our troops and our allies.', tokens=[848, 220, 300, 220, 700, 11, 220, 321, 220, 820, 220, 572, 83, 220, 362, 220, 64, 220, 12933, 220, 259, 220, 13658, 13, 220, 1396, 220, 291, 220, 848, 220, 321, 220, 820, 13, 220, 883, 86, 220, 291, 220, 601, 88, 220, 463, 88, 312, 220, 284, 220, 270, 220, 5946, 220, 35007, 220, 597, 220, 914, 82, 220, 572, 83, 220, 787, 220, 645, 220, 291, 220, 2085, 11, 220, 457, 220, 291, 220, 645, 220, 611, 220, 13181, 220, 293, 220, 2845, 278, 220, 7467, 220, 7897, 220, 748, 258, 220, 281, 220, 396, 220, 11522, 220, 293, 220, 396, 220, 14719, 13], temperature=0, avg_logprob=-0.049297530893926265, compression_ratio=1.4593023255813953, no_speech_prob=3.814697265625e-06, words=[Word(start=59.22, end=59.32, word='said', probability=0.0005269050598144531), Word(start=59.32, end=59.86, word=' that', probability=0.995361328125), Word(start=59.86, end=60.78, word=' first,', probability=0.920166015625), Word(start=60.8, end=61.02, word=' we', probability=0.992919921875), Word(start=61.02, end=61.34, word=' should', probability=0.945556640625), Word(start=61.34, end=62.0, word=' not', probability=0.9680989583333334), Word(start=62.0, end=62.18, word=' have', probability=0.9990234375), Word(start=62.18, end=62.26, word=' a', probability=0.998291015625), Word(start=62.26, end=62.66, word=' timeline', probability=0.975341796875), Word(start=62.66, end=62.76, word=' in', probability=0.998291015625), Word(start=62.76, end=63.32, word=' Afghanistan.', probability=0.999755859375), Word(start=63.36, end=63.46, word=' Then', probability=0.992431640625), Word(start=63.46, end=63.58, word=' you', probability=0.997802734375), Word(start=63.58, end=63.9, word=' said', probability=1.0), Word(start=63.9, end=64.48, word=' we', probability=0.957275390625), Word(start=64.48, end=64.82, word=' should.', probability=1.0), Word(start=64.86, end=65.22, word=' Now', probability=0.9991861979166666), Word(start=65.22, end=65.26, word=' you', probability=0.981201171875), Word(start=65.26, end=65.5, word=' say', probability=0.99951171875), Word(start=65.5, end=65.84, word=' maybe', probability=0.9796142578125), Word(start=65.84, end=66.54, word=' or', probability=0.959228515625), Word(start=66.54, end=66.62, word=' it', probability=0.8525390625), Word(start=66.62, end=67.08, word=' depends', probability=0.999755859375), Word(start=67.08, end=67.84, word=' [UH]', probability=0.9951171875), Word(start=67.84, end=68.06, word=' which', probability=0.906982421875), Word(start=68.06, end=68.32, word=' means', probability=0.9998372395833334), Word(start=68.32, end=68.5, word=' not', probability=0.9996744791666666), Word(start=68.5, end=68.8, word=' only', probability=1.0), Word(start=68.8, end=69.6, word=' were', probability=0.991455078125), Word(start=69.6, end=69.76, word=' you', probability=1.0), Word(start=69.76, end=70.04, word=' wrong,', probability=1.0), Word(start=70.04, end=70.16, word=' but', probability=0.999755859375), Word(start=70.16, end=70.24, word=' you', probability=0.99951171875), Word(start=70.24, end=70.3, word=' were', probability=0.688720703125), Word(start=70.3, end=70.52, word=' also', probability=0.999755859375), Word(start=70.52, end=71.0, word=' confusing', probability=0.9990234375), Word(start=71.0, end=71.08, word=' and', probability=0.8369140625), Word(start=71.08, end=71.4, word=' sending', probability=0.9990234375), Word(start=71.4, end=71.64, word=' mixed', probability=0.98388671875), Word(start=71.64, end=72.08, word=' messages', probability=0.999755859375), Word(start=72.08, end=72.3, word=' both', probability=0.9847005208333334), Word(start=72.3, end=72.4, word=' to', probability=1.0), Word(start=72.4, end=72.5, word=' our', probability=1.0), Word(start=72.5, end=72.82, word=' troops', probability=0.999755859375), Word(start=72.82, end=72.98, word=' and', probability=0.999755859375), Word(start=72.98, end=73.12, word=' our', probability=0.999755859375), Word(start=73.12, end=73.48, word=' allies.', probability=0.99951171875)])

Here is my guess:

Our model was finetuned with segments of less than 30 seconds without the prediction of the timestamp of the end of the segment. I think the longform transcription algorithm of faster whisper relies on that segment timestamp prediction ( the actual timestamp tokens) of the Whisper model.... so possibly something like this goes wrong in your algorithm with our model. But its hard to investigate without your full code.

@tjongsma
Copy link
Author

tjongsma commented Sep 6, 2024

Thanks for the help, I've tried to put together a full reproducible example here. This doesn't include the rolling window (for the given file it will just keep concatting the audio data to the array) but will give the same outputs and errors.

import argparse
import os
import sys
from time import sleep, time
from datetime import datetime, timedelta
from queue import Queue
from time import sleep
from sys import platform
import speech_recognition as sr
from rich.progress import Progress, TimeElapsedColumn, BarColumn, TextColumn

import numpy as np
from faster_whisper import WhisperModel

# Run on GPU with FP16
base_dir = os.path.dirname(os.path.abspath(sys.argv[0]))
model_dir = os.path.join(base_dir, "models", "cache", "crisper_whisper")
model = WhisperModel(model_dir, device="cuda", compute_type="float16")


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--energy_threshold",
        default=400,
        help="Energy level for mic to detect.",
        type=int,
    )
    parser.add_argument(
        "--record_timeout",
        default=2,
        help="How real time the recording is in seconds.",
        type=float,
    )
    parser.add_argument(
        "--phrase_timeout",
        default=20,
        help="How much empty space between recordings before we consider it a new line in the transcription.",
        type=float,
    )

    args = parser.parse_args()

    # The last time a recording was retrieved from the queue.
    phrase_time = None
    # Current raw audio bytes.
    last_sample = bytes()
    # Thread safe Queue for passing data from the threaded recording callback.
    data_queue = Queue()
    # We use SpeechRecognizer to record our audio because it has a nice feature where it can detect when speech ends.
    recorder = sr.Recognizer()
    recorder.energy_threshold = args.energy_threshold
    # Definitely do this, dynamic energy compensation lowers the energy threshold dramatically to a point where the SpeechRecognizer never stops recording.
    recorder.dynamic_energy_threshold = False
    
    sampling_rate = 16000

    # Important for linux users.
    # Prevents permanent application hang and crash by using the wrong Microphone
    if "linux" in platform:
        mic_name = args.default_microphone
        if not mic_name or mic_name == "list":
            print("Available microphone devices are: ")
            for index, name in enumerate(sr.Microphone.list_microphone_names()):
                print(f'Microphone with name "{name}" found')
            return
        else:
            for index, name in enumerate(sr.Microphone.list_microphone_names()):
                if mic_name in name:
                    source = sr.Microphone(
                        sample_rate=sampling_rate, device_index=index
                    )
                    break
    else:
        source = sr.Microphone(sample_rate=sampling_rate)

    record_timeout = args.record_timeout
    phrase_timeout = args.phrase_timeout

    transcription = [""]

    with source:
        recorder.adjust_for_ambient_noise(source)

    def record_callback(_, audio: sr.AudioData) -> None:
        """
        Threaded callback function to receive audio data when recordings finish.
        audio: An AudioData containing the recorded bytes.
        """
        # Grab the raw bytes and push it into the thread safe queue.
        data = audio.get_raw_data()
        data_queue.put(data)

    # Create a background thread that will pass us raw audio bytes.
    # We could do this manually but SpeechRecognizer provides a nice helper.
    recorder.listen_in_background(
        source, record_callback, phrase_time_limit=record_timeout
    )

    # Cue the user that we're ready to go.
    print("Model loaded.\n")

    # List to store complete chunks of audio data
    audio_chunks = []
    total_transcription_time = 0.0
    transcription_count = 0

    while True:
        try:
            now = datetime.utcnow()
            # Pull raw recorded audio from the queue.
            if not data_queue.empty():
                phrase_complete = False
                # If enough time has passed between recordings, consider the phrase complete.
                # Clear the current working audio buffer to start over with the new data.
                if phrase_time and now - phrase_time > timedelta(seconds=phrase_timeout):
                    audio_chunks = []
                    phrase_complete = True
                # This is the last time we received new audio data from the queue.
                phrase_time = now

                # Concatenate our current audio data with the latest audio data.
                while not data_queue.empty():
                    data = data_queue.get()
                    audio_chunks.append(data)
                total_length = sum(len(chunk) for chunk in audio_chunks)
                print(total_length/1000)
                if audio_chunks==[]:
                    last_sample=bytes()
                else:
                    last_sample = b''.join(audio_chunks)
                audio_array = np.frombuffer(last_sample, dtype=np.int16).astype(np.float32)
                audio_array /= 32768.0  # Normalize to range [-1.0, 1.0]

                start_time = time()

                # Read the transcription.
                with Progress(
                    TextColumn("🤗 [progress.description]{task.description}"),
                    BarColumn(style="yellow1", pulse_style="white"),
                    TimeElapsedColumn(),
                ) as progress:
                    progress.add_task("[yellow]Transcribing...", total=None)
                    segments, info = model.transcribe(audio_array, initial_prompt="", #These settings are being optimized for crisper-whisper. please refer to other files for normal settings
                                                            max_new_tokens=224,
                                                            beam_size=1,
                                                            language='en',
                                                            word_timestamps=True,
                                                            without_timestamps=True
                                                            )
                    # Initialize an empty string to store the full transcript
                    full_transcript = ""
                    # Iterate over the segments and concatenate the text
                    for segment in segments:
                        full_transcript += segment.text + " "  # Add a space between segments
                                    # Calculate elapsed time
                elapsed_time = time() - start_time
                # Update the total transcription time and count
                total_transcription_time += elapsed_time
                transcription_count += 1
                average_transcription_time = total_transcription_time / transcription_count
                # If we detected a pause between recordings, add a new item to our transcription.
                # Otherwise edit the existing one.
                if phrase_complete:
                    transcription.append(full_transcript)
                else:
                    transcription[-1] = full_transcript

                # Clear the console to reprint the updated transcription.
                os.system("cls" if os.name == "nt" else "clear")
                for line in transcription:
                    print(line)
                audio_length_seconds = len(last_sample) / (sampling_rate * 2)
                print(f"Time to transcribe: {elapsed_time:.2f} seconds, total sample length {audio_length_seconds:.2f} seconds")
                print(f"Average time to transcribe: {average_transcription_time:.2f} seconds")
                # Flush stdout.
                print("", end="", flush=True)
                # Infinite loops are bad for processors, must sleep.
                sleep(0.25)
        except KeyboardInterrupt:
            break

    print("\n\nTranscription:")
    for line in transcription:
        print(line)


if __name__ == "__main__":
    main()

@dgoryeo
Copy link

dgoryeo commented Sep 9, 2024

Hi @LaurinmyReha , I just tried trasncribing a Japanese audio but the result segments do not contain any text --the text attribute is empty. Might this be because of the language? Thanks.

@LaurinmyReha
Copy link
Contributor

Always hard to tell without the audio but i would not expect great things from this model in languages outside of german or english unfortunately since these are the only languages that we trained on after retokenization.

@dgoryeo
Copy link

dgoryeo commented Sep 9, 2024

Got it. Thanks!

@TechInterMezzo
Copy link

Our model was finetuned with segments of less than 30 seconds without the prediction of the timestamp of the end of the segment.

Do you have a suggestion for predicting the end of the segment so that it doesn't include pauses or breathing? Would post processing with forced alignment be the best way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants