Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SpecInfer] Update RequestManager #1096

Merged
merged 43 commits into from
Sep 25, 2023
Merged

[SpecInfer] Update RequestManager #1096

merged 43 commits into from
Sep 25, 2023

Conversation

zwang86
Copy link
Collaborator

@zwang86 zwang86 commented Sep 5, 2023

Description of changes:

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

  • Closes #

@zwang86 zwang86 marked this pull request as ready for review September 5, 2023 04:46
@jiazhihao jiazhihao added the inference Features and fixes related to the inference project. label Sep 5, 2023
@zwang86
Copy link
Collaborator Author

zwang86 commented Sep 11, 2023

It seems we still have some issues with longer prompt sets in the spec_beam_attention kernel. I am getting an CUDA error with these prompt set:

["Write a detailed product description for a food chopper tool that lets you chop fruits and vegetables.",
    "Write a short blog post (500 words) about the best dog toys for new dog owners.",
    "ChatGPT is rewriting Genesis.",
    "Please write the evolution of humans by natural selection in the form of a recipe."]
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
Cuda failure: 7
/home/zeyuwang/FlexFlow/src/ops/spec_inc_multihead_self_attention.cu:316
Aborting...
spec_infer: /home/zeyuwang/FlexFlow/src/ops/spec_inc_multihead_self_attention.cu:316: void FlexFlow::Kernels::SpecIncMultiHeadAttention::compute_attention_kernel(const FlexFlow::SpecIncMultiHeadSelfAttentionMeta*, const FlexFlow::BeamSearchBatchConfig*, int, DT*, const DT*, const DT*, cudaStream_t) [with DT = __half; cudaStream_t = CUstream_st*]: Assertion `false' failed.
Aborted (core dumped)

@zwang86
Copy link
Collaborator Author

zwang86 commented Sep 11, 2023

The issue

It seems we still have some issues with longer prompt sets in the spec_beam_attention kernel. I am getting an CUDA error with these prompt set:

["Write a detailed product description for a food chopper tool that lets you chop fruits and vegetables.",
    "Write a short blog post (500 words) about the best dog toys for new dog owners.",
    "ChatGPT is rewriting Genesis.",
    "Please write the evolution of humans by natural selection in the form of a recipe."]
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
Cuda failure: 7
/home/zeyuwang/FlexFlow/src/ops/spec_inc_multihead_self_attention.cu:316
Aborting...
spec_infer: /home/zeyuwang/FlexFlow/src/ops/spec_inc_multihead_self_attention.cu:316: void FlexFlow::Kernels::SpecIncMultiHeadAttention::compute_attention_kernel(const FlexFlow::SpecIncMultiHeadSelfAttentionMeta*, const FlexFlow::BeamSearchBatchConfig*, int, DT*, const DT*, const DT*, cudaStream_t) [with DT = __half; cudaStream_t = CUstream_st*]: Assertion `false' failed.
Aborted (core dumped)

The issue can be solved by applying a similar change as we discussed above.
Due to skipping attention computation for the pending request, this assertion no longer holds, but the out puts seems reasonable while commenting out the assertion. @xinhaoc Do you think we need to make any modifications in the attention kernel?
Note: although the code is located in spec_inc_multihead_self_attention, the func is actually called in verify_inc_multihead_self_attention.

@jiazhihao jiazhihao enabled auto-merge (squash) September 25, 2023 15:53
@jiazhihao jiazhihao merged commit 0a56d01 into inference Sep 25, 2023
38 of 39 checks passed
@zwang86 zwang86 deleted the update_rm_backup branch September 28, 2023 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inference Features and fixes related to the inference project.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants