[L0 v2] implement deferred kernel deallocation #2451

igchor · 2024-12-11T23:45:01Z

No description provided.

github-actions · 2024-12-12T19:07:37Z

Compute Benchmarks level_zero_v2 run (with params: --compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/12303108388

github-actions · 2024-12-12T19:45:05Z

Compute Benchmarks level_zero_v2 run (--compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/12303108388
Job status: success. Test status: success.

Summary

No diffs to calculate performance change

(result is better)

Performance change in benchmark groups

Relative perf in group api (11): cannot calculate

Benchmark	This PR	baseline	baseline-v2
api_overhead_benchmark_l0 SubmitKernel out of order	11.689 μs	15.106 μs	11.480000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	24.790 μs	26.622 μs	21.754000 μs
api_overhead_benchmark_sycl SubmitKernel in order	22.172 μs	24.625 μs	22.118000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	1.850000 μs	2.438 μs	1.902 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	2.144 μs	1.660000 μs	1.862 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	95094.000 instr	101653.000 instr	94784.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order	16.050 μs	18.566 μs	13.468000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	95094.000 instr	106771.000 instr	94784.000000 instr
api_overhead_benchmark_ur SubmitKernel in order	16.559 μs	16.326000 μs	16.789 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	96602.000000 instr	-	-
api_overhead_benchmark_ur SubmitKernel in order with measure completion	20.410000 μs	-	-

Relative perf in group memory (4): cannot calculate

Benchmark	This PR	baseline	baseline-v2
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	204.241 μs	251.714 μs	200.797000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	85.519000 μs	133.281 μs	86.622 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	6.082 μs	5.537000 μs	6.070 μs
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	2.934 GB/s	3.178000 GB/s	2.967 GB/s

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	803.250 bw GB/s	802.226 bw GB/s	729.022000 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline	baseline-v2
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	3580.598 μs	6962.444 μs	3554.575000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	8176.248000 μs	17543.043 μs	8376.730 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	26293.733000 μs	48014.641 μs	26351.892 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	1131.730 μs	2046.635 μs	1095.315000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	4525.514 μs	7363.204 μs	4489.265000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	6769.547 μs	8577.774 μs	6697.709000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	26547.173 μs	25823.197000 μs	27030.041 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1124.053 μs	1182.072 μs	1112.265000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	29470.166 μs	41125.560 μs	28897.706000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	114922.030 μs	110418.895000 μs	114981.529 μs

Relative perf in group Velocity-Bench (9): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Velocity-Bench Hashtable	381.957 M keys/sec	379.484 M keys/sec	384.922230 M keys/sec
Velocity-Bench Bitcracker	35.228 s	35.201 s	35.145600 s
Velocity-Bench CudaSift	202.851000 ms	203.857 ms	-
Velocity-Bench Easywave	234.000000 ms	241.000 ms	236.000 ms
Velocity-Bench QuickSilver	121.340 MMS/CTT	118.600 MMS/CTT	121.360000 MMS/CTT
Velocity-Bench Sobel Filter	514.208 ms	530.614 ms	513.216000 ms
Velocity-Bench dl-cifar	17.457400 s	24.483 s	17.690 s
Velocity-Bench dl-mnist	2.700 s	2.730 s	2.690000 s
Velocity-Bench svm	-	0.135800 s	-

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Runtime_IndependentDAGTaskThroughput_SingleTask	175.898 ms	260.763 ms	173.622000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	183.271000 ms	269.665 ms	187.100 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	185.821 ms	272.768 ms	184.117000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	183.160 ms	273.458 ms	181.955000 ms
Runtime_DAGTaskThroughput_SingleTask	1289.630000 ms	1640.062 ms	1292.755 ms
Runtime_DAGTaskThroughput_BasicParallelFor	1371.226 ms	1708.507 ms	1369.885000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1351.879000 ms	1719.016 ms	1352.726 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	1318.376000 ms	1651.204 ms	1336.026 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline	baseline-v2
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.454 ms	4.383 ms	4.313000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.474 ms	4.402000 ms	4.473 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.342000 ms	4.564 ms	4.468 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	3.647000 ms	4.582 ms	3.724 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	618.094 ms	618.085 ms	618.046000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	618.095000 ms	618.138 ms	618.138 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.371000 ms	4.376 ms	4.439 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	4.584 ms	4.543 ms	4.491000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	4.554 ms	4.492 ms	4.430000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	3.754000 ms	4.614 ms	3.864 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.434 ms	617.458 ms	617.433000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	617.448 ms	617.469 ms	617.384000 ms
MicroBench_LocalMem_int32_4096	29.913 ms	29.912000 ms	29.916 ms
MicroBench_LocalMem_fp32_4096	29.926 ms	29.820000 ms	29.887 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Pattern_Reduction_NDRange_int32	17.032 ms	16.581000 ms	16.662 ms
Pattern_Reduction_Hierarchical_int32	17.045 ms	16.994 ms	16.977000 ms
Pattern_SegmentedReduction_NDRange_int16	2.251000 ms	2.267 ms	2.254 ms
Pattern_SegmentedReduction_NDRange_int32	2.166000 ms	2.170 ms	2.168 ms
Pattern_SegmentedReduction_NDRange_int64	2.347 ms	2.343000 ms	2.345 ms
Pattern_SegmentedReduction_NDRange_fp32	2.163000 ms	2.173 ms	2.163 ms
Pattern_SegmentedReduction_Hierarchical_int16	11.800 ms	11.809 ms	11.796000 ms
Pattern_SegmentedReduction_Hierarchical_int32	11.600 ms	11.592000 ms	11.599 ms
Pattern_SegmentedReduction_Hierarchical_int64	11.782 ms	11.779000 ms	11.788 ms
Pattern_SegmentedReduction_Hierarchical_fp32	11.587000 ms	11.596 ms	11.602 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
ScalarProduct_NDRange_int32	3.959 ms	3.888000 ms	3.955 ms
ScalarProduct_NDRange_int64	5.605 ms	5.454000 ms	5.610 ms
ScalarProduct_NDRange_fp32	3.873 ms	3.752000 ms	3.889 ms
ScalarProduct_Hierarchical_int32	10.333 ms	10.332 ms	10.320000 ms
ScalarProduct_Hierarchical_int64	11.378 ms	11.327000 ms	11.360 ms
ScalarProduct_Hierarchical_fp32	10.006 ms	9.960000 ms	9.969 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline	baseline-v2
USM_Allocation_latency_fp32_host	37.507 ms	37.576 ms	37.372000 ms
USM_Allocation_latency_fp32_shared	0.063000 ms	0.064 ms	0.067 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.334000 ms	1.674 ms	1.335 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.039 ms	1.057 ms	1.017000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.585 ms	1.817 ms	1.568000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.192 ms	1.196 ms	1.168000 ms
USM_Allocation_latency_fp32_device	-	0.066000 ms	-

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline	baseline-v2
VectorAddition_int32	1.684 ms	1.597000 ms	1.604 ms
VectorAddition_int64	3.125000 ms	3.128 ms	3.229 ms
VectorAddition_fp32	1.491000 ms	1.566 ms	1.659 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Polybench_2mm	1.226 ms	1.222000 ms	1.223 ms
Polybench_3mm	1.815 ms	1.728000 ms	1.803 ms
Polybench_Atax	6.845000 ms	6.865 ms	6.875 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
Kmeans_fp32	16.056000 ms	16.056 ms	16.057 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
LinearRegressionCoeff_fp32	687.151000 ms	844.076 ms	717.509 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
MolecularDynamics	0.030 ms	0.031 ms	0.029000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
llama.cpp Prompt Processing Batched 128	865.683142 token/s	840.657 token/s	809.359 token/s
llama.cpp Text Generation Batched 128	65.184 token/s	62.642 token/s	65.422755 token/s
llama.cpp Prompt Processing Batched 256	939.670724 token/s	894.023 token/s	938.272 token/s
llama.cpp Text Generation Batched 256	65.131 token/s	62.623 token/s	65.353065 token/s
llama.cpp Prompt Processing Batched 512	488.138258 token/s	455.270 token/s	483.743 token/s
llama.cpp Text Generation Batched 512	65.271 token/s	62.636 token/s	65.393527 token/s

Relative perf in group alloc/max (20): cannot calculate

Benchmark	This PR	baseline	baseline-v2
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 glibc	2696.670 ns	2403.900000 ns	-
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 glibc	699.753000 ns	707.017 ns	-
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 glibc	1241.320000 ns	1271.710 ns	-
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 glibc	752.731 ns	745.948000 ns	-
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:4 glibc	883.178 ns	863.918000 ns	-
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:1 glibc	174.332 ns	174.292000 ns	-
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 os_provider	2164.190 ns	2162.040000 ns	-
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 os_provider	187.841 ns	186.889000 ns	-
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 os_provider	1942.740 ns	1897.660000 ns	-
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 os_provider	192.017 ns	191.382000 ns	-
alloc/max_allocs:1000/pre_allocs:0/size:4096/iterations:200000/threads:4 proxy_pool<os_provider>	4682.930 ns	4305.000000 ns	-
alloc/max_allocs:1000/pre_allocs:0/size:4096/iterations:200000/threads:1 proxy_pool<os_provider>	270.297 ns	263.433000 ns	-
alloc/max_allocs:1000/pre_allocs:100000/size:4096/iterations:200000/threads:4 proxy_pool<os_provider>	4634.400 ns	3865.570000 ns	-
alloc/max_allocs:1000/pre_allocs:100000/size:4096/iterations:200000/threads:1 proxy_pool<os_provider>	302.373 ns	301.460000 ns	-
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:4 scalable_pool<os_provider>	281.535 ns	269.766000 ns	-
alloc/max_allocs:10000/pre_allocs:0/size:4096/iterations:200000/threads:1 scalable_pool<os_provider>	216.620 ns	214.796000 ns	-
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:4 scalable_pool<os_provider>	245.713000 ns	261.526 ns	-
alloc/max_allocs:10000/pre_allocs:100000/size:4096/iterations:200000/threads:1 scalable_pool<os_provider>	209.871 ns	206.096000 ns	-
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:4 scalable_pool<os_provider>	1010.610 ns	1002.850000 ns	-
alloc/max_allocs:10000/pre_allocs:0/min size:8/max size:65536/granularity:8/iterations:200000/threads:1 scalable_pool<os_provider>	964.081000 ns	967.182 ns	-

Relative perf in group multiple (12): cannot calculate

Benchmark	This PR	baseline	baseline-v2
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 glibc	32015.000000 ns	32646.200 ns	-
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 glibc	4170.530 ns	4138.910000 ns	-
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:4 glibc	138847.000 ns	137197.000000 ns	-
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:1 glibc	30802.100000 ns	30826.900 ns	-
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 proxy_pool<os_provider>	1187820.000000 ns	1189740.000 ns	-
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 proxy_pool<os_provider>	164323.000 ns	158272.000000 ns	-
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 os_provider	1190450.000 ns	1179140.000000 ns	-
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 os_provider	144905.000 ns	140583.000000 ns	-
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:4 scalable_pool<os_provider>	45963.500 ns	41640.200000 ns	-
multiple_malloc_free/max_allocs:10000/size:4096/iterations:2000/threads:1 scalable_pool<os_provider>	15157.000 ns	14614.500000 ns	-
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:4 scalable_pool<os_provider>	72189.100 ns	70492.600000 ns	-
multiple_malloc_free/max_allocs:10000/min size:8/max size:65536/granularity:8/iterations:2000/threads:1 scalable_pool<os_provider>	25804.100 ns	25434.100000 ns	-

Output:

---------> BitCracker: BitLocker password cracking tool <---------

==================================
Retrieving Info

Reading hash file "/home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt"

              Attack

================================================
Type of attack: User Password
Psw per thread: 1
max_num_pswd_per_read: 60000
Dictionary: /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt
MAC Comparison (-m): Yes

Iter: 1, num passwords read: 60000
Kernel execution:
Effective passwords: 60000
Passwords Range:
npknpByH7N2m3OnLNH1X9DJxLrzIFWk
.....
dL_7uuf3QCz-c6K3xDu0

================================================
Bitcracker attack completed
Total passwords evaluated: 60000
Password not found!

time to subtract from total: 0.00408108 s
bitcracker - total time for whole calculation: 35.2284 s

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Output:

UNKN:

UNKN: ==================================================
UNKN: User input parameters:
UNKN: Trace: ../../inputData
UNKN: ==================================================
UNKN:

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1229 1264 33.3695% 1 2