Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Optimize RMS Stack Size for Better Performance #26515

Merged
merged 11 commits into from
Sep 11, 2024

Conversation

zaixing-wang
Copy link
Contributor

@zaixing-wang zaixing-wang commented Sep 10, 2024

Details:

  • Since stack_size can be calculated, we can use smaller stack size instead of 33 to achieve better performance.

Performance:

  • qwen2-0.5B’s throughput increased from 40.2 qps to 41.8 qps.

Tickets:

@zaixing-wang zaixing-wang requested review from a team as code owners September 10, 2024 08:46
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Sep 10, 2024
@zaixing-wang zaixing-wang changed the title [GPU] optimize RMS kernel [GPU] optimize RMS Stack Size for Better Performance Sep 10, 2024
@@ -120,7 +122,7 @@ RMSKernelBase::DispatchData RMSKernelBfyxOpt::SetDefault(const rms_params& param

dispatchData.itemsNum = dispatchData.dataSize;
Copy link
Contributor

@dnkurek dnkurek Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hint: maybe check if dispatchData.dataSize is actually constant and won't change when executing LLM, therefore we know beforehand what will always dispatchData.itemsNum be in each execution. In other words, maybe the only thing that changes is dataCount

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Qwen, dataSize is always equal to 896.

Copy link
Contributor

@dnkurek dnkurek Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so that looks to me like a possible optimization there, since now you exactly know how much private memory you will need and better use the GPU's resources

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also you would know LWS and therefore you could use reqd_work_group_size

Copy link
Contributor

@dnkurek dnkurek Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also 896 seems relatively not a lot. It makes sense why reducing stack size would improve performance, since you are basically freeing up resources that are not used.

@zaixing-wang zaixing-wang changed the title [GPU] optimize RMS Stack Size for Better Performance [GPU] Optimize RMS Stack Size for Better Performance Sep 10, 2024
@vladimir-paramuzov vladimir-paramuzov added this to the 2024.5 milestone Sep 11, 2024
@vladimir-paramuzov vladimir-paramuzov added this pull request to the merge queue Sep 11, 2024
@dnkurek dnkurek removed this pull request from the merge queue due to a manual request Sep 11, 2024
@dnkurek dnkurek added this pull request to the merge queue Sep 11, 2024
Merged via the queue into openvinotoolkit:master with commit 90d1219 Sep 11, 2024
135 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants