Aciddelgado/continuous #867

aciddelgado · 2024-09-03T17:47:41Z

No description provided.

Results are validated with model-generate.py by using a int4 quantized model as the original model's assistant. The output sequence is the same and increased tps is observed. NOTE: Only MHA decoder only models, batch size 1, CPU, greedy select top is supported in this initial version. GQA needs microsoft/onnxruntime#21523 to support seqlen > 1 in token phase. * Updated builder.py to produce MHA graph that supports seqlen > 1 in token phase. * Introduce speculative decoding currently through a separate Generator class. This can be merged with existing Generator potentially on either API level or implementation level. * Extended various components for functionalities to support speculative search. Previously most methods are hardcoded assuming seqlen == 1 for token phase.

BowenBao and others added 15 commits August 2, 2024 16:53

merge main

dec83aa

make build

396f17a

remove unnecessary

0dd4572

so ryan can see changes

1029333

decoder only cpu greedy works

3e3d56a

clean up comments

9f7d0e0

cuda working i think

b80878e

move input ids back to where they were

9e03c4f

fix batch_size > 1

fc3a0d3

working on rewind

0d971fd

b size 1 cpu reverse working

2fed10c

rewind working on cuda

1c86984

small stuff

60c42c6

merge main and remove batch_size duplication

baf605b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aciddelgado/continuous #867

Aciddelgado/continuous #867

aciddelgado commented Sep 3, 2024

Aciddelgado/continuous #867

Are you sure you want to change the base?

Aciddelgado/continuous #867

Conversation

aciddelgado commented Sep 3, 2024