Whisper.jl is a Julia package for automatic speech recognition, based on OpenAI's Whisper model. This package wraps the whisper.cpp code, which is a C/C++ implementation of the model. It uses the model weights which were published by OpenAI. The weights are downloaded on demand.
The transcribe
function is the simplest way to run speech recognition on an
audio signal. It takes as input the model name, and a float32 array with the sound signal.
The signal is expected to be sampled at 16kHz.
using Whisper, LibSndFile, FileIO, SampledSignals
s = load("/path/to/audio_file.ogg")
# Whisper expects 16kHz sample rate and Float32 data
sout = SampleBuf(Float32, 16000, round(Int, length(s)*(16000/samplerate(s))), nchannels(s))
write(SampleBufSink(sout), SampleBufSource(s)) # Resample
if nchannels(sout) == 1
data = sout.data
elseif nchannels(sout) == 2
sd = sout.data
data = [sd[i,1] + sd[i,2] for i in 1:size(sd)[1]] #convert stereo to mono
end
result = transcribe("base.en", data)
For more control, the whisper.cpp C interface is available within the Whisper.LibWhisper
module.
The following models are available, and should be referenced by name. The models with the .en
suffix are specialised for the english language.
Model | Disk | Mem |
---|---|---|
tiny / tiny.en | 75 MB | ~125 MB |
base / base.en | 142 MB | ~210 MB |
small / small.en | 466 MB | ~600 MB |
medium / medium.en | 1.5 GB | ~1.7 GB |
large-v1 / large.en | 2.9 GB | ~3.3 GB |
The biggest missing functionality is a streaming interface.