Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the Whisper model in onnxruntime-genai #699

Merged
merged 46 commits into from
Sep 17, 2024
Merged

Conversation

baijumeswani
Copy link
Contributor

@baijumeswani baijumeswani commented Jul 15, 2024

Many changes in this PR are from @RyanUnderhill's whisper branch.

This pull-request introduces support for running the openai/whisper model. In particular, this pull-request introduces the following:

  • Whisper preprocessing: Given an audio file, the preprocessing stage creates the mel spectrogram and the decoder input ids that are needed for the model execution.

  • Whisper model execution: Given the log mel spectrogram and the decoder input ids, the whisper onnx models can be executed. The model execution is split into three phases

    • EncoderDecoderInit: The corresponding ONNX model for this phase contains the Encoder and the Decoder with the Attention operator. This model is run on the first token generation only.
    • DecoderInit: This phase manages the transition from first token generation to the second token generation. i.e. From managing the inputs/outputs of the EncoderDecoderInit ONNX models to managing the inputs/outputs of the Decoder ONNX model.
    • Decoder: This phase manages all the remaining token generation steps. The corresponding ONNX model contains only the decoder logic with the DecoderMaskedMultiHeadAttention operator.

    The model execution also manages the outputs of the model: logits and optionally the cross_qk buffers required for computing the word level timestamps.

  • The Python, C, C++ APIs for loading audios, preprocessing the audio files, and executing the whisper model.

With this pull-request, onnxruntime-genai can execute the openai/whisper model on CPU (fp32) and CUDA (fp16 and fp32) EPs with batch_size >= 1 and beam_size >= 1

Changes required that are not part of this pull-request:

  • Making the cross_qk buffers available as outputs for the CPU EP.
  • Refining the user API.
  • Add C# API to load multiple audio files for batch size > 1.
  • Add C# example.
  • Splitting the phases into two Encoder and Decoder to avoid the decoder weight duplication in EncoderDecoderInit.
  • An end-to-end example with word-level timestamps

@baijumeswani baijumeswani force-pushed the baijumeswani/whisper branch 3 times, most recently from 37c5310 to aea7898 Compare July 24, 2024 23:49
examples/python/whisper.py Fixed Show fixed Hide fixed
@baijumeswani baijumeswani requested a review from a team as a code owner September 4, 2024 19:41
src/models/model.h Outdated Show resolved Hide resolved
@yufenglee
Copy link
Member

we need to add some tests in following PR.

@yufenglee
Copy link
Member

and add a description on how to create the model


In reply to: 2347714980

baijumeswani and others added 2 commits September 16, 2024 15:05
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
@baijumeswani baijumeswani merged commit 01f259f into main Sep 17, 2024
13 checks passed
@baijumeswani baijumeswani deleted the baijumeswani/whisper branch September 17, 2024 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants