Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To profile the Intel GPU kernels with the SYCL event profiling time stamp instead of barrier time diff. #1136

Merged
merged 1 commit into from
May 17, 2024

Conversation

chengjunlu
Copy link
Contributor

It is more accurate to use the SYCL event profiling time stamp after the immediate command list has been enabled.

To use the SYCL event profiling to capture the Intel GPU kernel time.

@ESI-SYD
Copy link
Contributor

ESI-SYD commented May 16, 2024

I think we can enable Triton-benchmark CI, especially benchmarks files changed, to ensure the basic function.
https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/.github/workflows/triton-benchmarks.yml#L3

on:
  workflow_dispatch:
    inputs:
      runner_label:
        description: Runner label, keep empty for default
        type: string
        default: ""
  pull_request:
    branches:
      - llvm-target
    paths:
      - 'benchmarks/**'  
  schedule:
    - cron: "5 23 * * *"

@chengjunlu
Copy link
Contributor Author

I think we can enable Triton-benchmark CI, especially benchmarks files changed, to ensure the basic function. https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/.github/workflows/triton-benchmarks.yml#L3

on:
  workflow_dispatch:
    inputs:
      runner_label:
        description: Runner label, keep empty for default
        type: string
        default: ""
  pull_request:
    branches:
      - llvm-target
    paths:
      - 'benchmarks/**'  
  schedule:
    - cron: "5 23 * * *"

We need Pavel to review the side effect in running the micro benchmark for each PR. And let's use a sperate PR for review that.

@chengjunlu chengjunlu force-pushed the chengjun/llvm-target-update-microbench branch 2 times, most recently from 7073778 to 81e6d06 Compare May 16, 2024 08:39
benchmarks/xetla_benchmark/fused_softmax.py Outdated Show resolved Hide resolved
@chengjunlu chengjunlu force-pushed the chengjun/llvm-target-update-microbench branch from 81e6d06 to 65c3602 Compare May 16, 2024 08:52
@chengjunlu chengjunlu merged commit 5e2256f into llvm-target May 17, 2024
2 checks passed
@etiotto etiotto deleted the chengjun/llvm-target-update-microbench branch May 17, 2024 15:42
@@ -130,16 +113,23 @@ def benchmark(M, N, provider):
x = torch.randn(M, N, device='xpu', dtype=torch.bfloat16)
quantiles = [0.5, 0.2, 0.8]
if provider == 'torch-native':
ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch.softmax(x, axis=-1), quantiles=quantiles, warmup=10,
ms, min_ms, max_ms = benchmark_suit.do_bench(lambda: torch.softmax(x, axis=-1), quantiles=quantiles, warmup=10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengjunlu why can't we use the same do_bench as Triton uses to compute the benchmark timing ?
What is the difference between triton.testing.do_bench and benchmark_suit.do_bench. Note that the former now uses event profiling torch.xpu.Event now.

Copy link
Contributor Author

@chengjunlu chengjunlu May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The torch.xpu.Event is based on the SYCL barrier. And we measure the approximate kernel time by diff the start barrier time stamp and the end barrier time stamp.
The method is less accurate after the SYCL runtime default uses the immediate command list to reduce the latency. In the experimental, we found there could be 10% difference between the approximate time and the actual time for the kernel.

In the micro-benchmark tools, we use the SYCL event profiling time returned by the SYCL launcher interface which is the actual time for the kernel for short term.

I think later we need to support the Triton Proton to profile the Triton XPU kernel for long term.

@@ -92,7 +92,8 @@ sycl::event softmax_forward(void *input, void *output, sycl::queue &queue) {
// sycl::info::event_profiling::command_start>()) /
// (1000.0f * 1000.0f * 1000.f);

// printf("M: %d, Data_type_in: %d, Bandwidth: GB/S: %f \n", mat_m,
// printf("M: %d, N: %d Data_type_in: %d, Bandwidth: GB/S: %f \n", mat_m,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengjunlu why do we keep commented out lines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time.
3 participants