-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
To profile the Intel GPU kernels with the SYCL event profiling time stamp instead of barrier time diff. #1136
To profile the Intel GPU kernels with the SYCL event profiling time stamp instead of barrier time diff. #1136
Conversation
I think we can enable Triton-benchmark CI, especially on:
workflow_dispatch:
inputs:
runner_label:
description: Runner label, keep empty for default
type: string
default: ""
pull_request:
branches:
- llvm-target
paths:
- 'benchmarks/**'
schedule:
- cron: "5 23 * * *" |
We need Pavel to review the side effect in running the micro benchmark for each PR. And let's use a sperate PR for review that. |
7073778
to
81e6d06
Compare
81e6d06
to
65c3602
Compare
@@ -130,16 +113,23 @@ def benchmark(M, N, provider): | |||
x = torch.randn(M, N, device='xpu', dtype=torch.bfloat16) | |||
quantiles = [0.5, 0.2, 0.8] | |||
if provider == 'torch-native': | |||
ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch.softmax(x, axis=-1), quantiles=quantiles, warmup=10, | |||
ms, min_ms, max_ms = benchmark_suit.do_bench(lambda: torch.softmax(x, axis=-1), quantiles=quantiles, warmup=10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chengjunlu why can't we use the same do_bench
as Triton uses to compute the benchmark timing ?
What is the difference between triton.testing.do_bench
and benchmark_suit.do_bench
. Note that the former now uses event profiling torch.xpu.Event
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The torch.xpu.Event
is based on the SYCL barrier. And we measure the approximate kernel time by diff the start barrier time stamp and the end barrier time stamp.
The method is less accurate after the SYCL runtime default uses the immediate command list to reduce the latency. In the experimental, we found there could be 10% difference between the approximate time and the actual time for the kernel.
In the micro-benchmark tools, we use the SYCL event profiling time returned by the SYCL launcher interface which is the actual time for the kernel for short term.
I think later we need to support the Triton Proton
to profile the Triton XPU kernel for long term.
@@ -92,7 +92,8 @@ sycl::event softmax_forward(void *input, void *output, sycl::queue &queue) { | |||
// sycl::info::event_profiling::command_start>()) / | |||
// (1000.0f * 1000.0f * 1000.f); | |||
|
|||
// printf("M: %d, Data_type_in: %d, Bandwidth: GB/S: %f \n", mat_m, | |||
// printf("M: %d, N: %d Data_type_in: %d, Bandwidth: GB/S: %f \n", mat_m, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chengjunlu why do we keep commented out lines?
It is more accurate to use the SYCL event profiling time stamp after the immediate command list has been enabled.
To use the SYCL event profiling to capture the Intel GPU kernel time.