f16 convolution gives the same performance as f32 #1130

alvoron · 2024-07-31T15:53:39Z

ACL 24.07
ACL build command:

scons neon=1 opencl=0 openmp=1 cppthreads=0 os=linux data_layout_support=all arch=arm64-v8.2-a build=native --jobs=64 build=native --silent fixed_format_kernels=True Werror=0

benchdnn build command:

ACL_ROOT_DIR=$PWD/../ComputeLibrary cmake -B build -DCMAKE_BUILD_TYPE=Release -DDNNL_USE_ACL=ON -DCMAKE_RULE_MESSAGES=OFF -DACL_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute.so -DACL_CORE_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute.so -DACL_GRAPH_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_graph.so -DDNNL_CPU_RUNTIME=OMP
cmake --build build --target benchdnn --parallel $(nproc)

Reproducer commands:

taskset -c 0 ./benchdnn --max-ms-per-prb=10e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=direct --dt=f16:f16:f16 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1
taskset -c 0 ./benchdnn --max-ms-per-prb=10e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1

NHWC layout recommended by ACL is used in reproducer.
taskset is used to force single thread mode and avoid threading issues.

The 1st command (f16 convolution) gives 0.267766 ms, the 2nd one (f32 convolution) gives 0.273554 ms on Ampere.
I'd expect better f16 convolution performance.

If reproducer command is called with DNNL_VERBOSE=1 then we observe 2 convolutions in f16 case:

onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f16::blocked:acdb::f0 wei_f16:ap:blocked:Acdb8a::f0 bia_undef::undef::: dst_f16::blocked:acdb::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1,0.501953
onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f32:a:blocked:acdb::f0 wei_f32:a:blocked:Acdb4a::f0 bia_undef::undef::: dst_f32:a:blocked:acdb::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1,0.444824

and 1 convolution in fp32 case:

onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f32::blocked:acdb::f0 wei_f32:a:blocked:Acdb4a::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1,0.112061

It's not clear the purpose of the 2nd convolution in f16 case (moreover, it's f32 convolution). Probably, it's ACL integration in oneDNN issue rather than ACL issue.

The text was updated successfully, but these errors were encountered:

alvoron · 2024-08-28T12:14:37Z

@morgolock I double checked the issue description and I think, I can't provide standalone ACL reproducer.
Perhaps this issue needs to be reviewed from oneDNN integration point of view, since oneDNN calls 2 convolution primitives in fp16 case and only 1 primitive in fp32 case.
So, probably, it's not ACL issue, but ACL integration into oneDNN issue.
Should we ask Milos to take a look at this?

morgolock added Help wanted Question Performance labels Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

f16 convolution gives the same performance as f32 #1130

f16 convolution gives the same performance as f32 #1130

alvoron commented Jul 31, 2024

alvoron commented Aug 28, 2024

f16 convolution gives the same performance as f32 #1130

f16 convolution gives the same performance as f32 #1130

Comments

alvoron commented Jul 31, 2024

alvoron commented Aug 28, 2024