To profile the Intel GPU kernels with the SYCL event profiling time stamp instead of barrier time diff. #1136

chengjunlu · 2024-05-16T06:49:29Z

It is more accurate to use the SYCL event profiling time stamp after the immediate command list has been enabled.

To use the SYCL event profiling to capture the Intel GPU kernel time.

ESI-SYD · 2024-05-16T07:27:17Z

I think we can enable Triton-benchmark CI, especially benchmarks files changed, to ensure the basic function.
https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/.github/workflows/triton-benchmarks.yml#L3

on:
  workflow_dispatch:
    inputs:
      runner_label:
        description: Runner label, keep empty for default
        type: string
        default: ""
  pull_request:
    branches:
      - llvm-target
    paths:
      - 'benchmarks/**'  
  schedule:
    - cron: "5 23 * * *"

chengjunlu · 2024-05-16T08:09:25Z

I think we can enable Triton-benchmark CI, especially benchmarks files changed, to ensure the basic function. https://github.com/intel/intel-xpu-backend-for-triton/blob/llvm-target/.github/workflows/triton-benchmarks.yml#L3
on:
  workflow_dispatch:
    inputs:
      runner_label:
        description: Runner label, keep empty for default
        type: string
        default: ""
  pull_request:
    branches:
      - llvm-target
    paths:
      - 'benchmarks/**'  
  schedule:
    - cron: "5 23 * * *"

We need Pavel to review the side effect in running the micro benchmark for each PR. And let's use a sperate PR for review that.

benchmarks/xetla_benchmark/fused_softmax.py

etiotto · 2024-05-17T15:47:31Z

benchmarks/xetla_benchmark/fused_softmax.py

@@ -130,16 +113,23 @@ def benchmark(M, N, provider):
    x = torch.randn(M, N, device='xpu', dtype=torch.bfloat16)
    quantiles = [0.5, 0.2, 0.8]
    if provider == 'torch-native':
-        ms, min_ms, max_ms = triton.testing.do_bench(lambda: torch.softmax(x, axis=-1), quantiles=quantiles, warmup=10,
+        ms, min_ms, max_ms = benchmark_suit.do_bench(lambda: torch.softmax(x, axis=-1), quantiles=quantiles, warmup=10,


@chengjunlu why can't we use the same do_bench as Triton uses to compute the benchmark timing ?
What is the difference between triton.testing.do_bench and benchmark_suit.do_bench. Note that the former now uses event profiling torch.xpu.Event now.

The torch.xpu.Event is based on the SYCL barrier. And we measure the approximate kernel time by diff the start barrier time stamp and the end barrier time stamp.
The method is less accurate after the SYCL runtime default uses the immediate command list to reduce the latency. In the experimental, we found there could be 10% difference between the approximate time and the actual time for the kernel.

In the micro-benchmark tools, we use the SYCL event profiling time returned by the SYCL launcher interface which is the actual time for the kernel for short term.

I think later we need to support the Triton Proton to profile the Triton XPU kernel for long term.

etiotto · 2024-05-17T15:48:03Z

benchmarks/xetla_kernel/softmax/softmax.h

@@ -92,7 +92,8 @@ sycl::event softmax_forward(void *input, void *output, sycl::queue &queue) {
    //                    sycl::info::event_profiling::command_start>()) /
    //               (1000.0f * 1000.0f * 1000.f);

-    // printf("M: %d, Data_type_in: %d, Bandwidth: GB/S: %f \n", mat_m,
+    // printf("M: %d, N: %d Data_type_in: %d, Bandwidth: GB/S: %f \n", mat_m,


@chengjunlu why do we keep commented out lines?

chengjunlu linked an issue May 16, 2024 that may be closed by this pull request

[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. #1066

Closed

chengjunlu force-pushed the chengjun/llvm-target-update-microbench branch 2 times, most recently from 7073778 to 81e6d06 Compare May 16, 2024 08:39

ESI-SYD approved these changes May 16, 2024

View reviewed changes

benchmarks/xetla_benchmark/fused_softmax.py Outdated Show resolved Hide resolved

chengjunlu force-pushed the chengjun/llvm-target-update-microbench branch from 81e6d06 to 65c3602 Compare May 16, 2024 08:52

Use the SYCL event profiling time stamp to do the benchmark in softmax.

65c3602

chengjunlu merged commit 5e2256f into llvm-target May 17, 2024
2 checks passed

etiotto deleted the chengjun/llvm-target-update-microbench branch May 17, 2024 15:42

etiotto reviewed May 17, 2024

View reviewed changes

etiotto mentioned this pull request May 17, 2024

[Profiler] Use the Kineto to profile Triton XPU Kernel's accuracy execution time. #1066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To profile the Intel GPU kernels with the SYCL event profiling time stamp instead of barrier time diff. #1136

To profile the Intel GPU kernels with the SYCL event profiling time stamp instead of barrier time diff. #1136

chengjunlu commented May 16, 2024

ESI-SYD commented May 16, 2024

chengjunlu commented May 16, 2024

etiotto May 17, 2024

chengjunlu May 18, 2024 •

edited

Loading

etiotto May 17, 2024

To profile the Intel GPU kernels with the SYCL event profiling time stamp instead of barrier time diff. #1136

To profile the Intel GPU kernels with the SYCL event profiling time stamp instead of barrier time diff. #1136

Conversation

chengjunlu commented May 16, 2024

ESI-SYD commented May 16, 2024

chengjunlu commented May 16, 2024

etiotto May 17, 2024

Choose a reason for hiding this comment

chengjunlu May 18, 2024 • edited Loading

Choose a reason for hiding this comment

etiotto May 17, 2024

Choose a reason for hiding this comment

chengjunlu May 18, 2024 •

edited

Loading