Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

f16 convolution gives the same performance as f32 #1130

Open
alvoron opened this issue Jul 31, 2024 · 1 comment
Open

f16 convolution gives the same performance as f32 #1130

alvoron opened this issue Jul 31, 2024 · 1 comment

Comments

@alvoron
Copy link

alvoron commented Jul 31, 2024

ACL 24.07
ACL build command:

scons neon=1 opencl=0 openmp=1 cppthreads=0 os=linux data_layout_support=all arch=arm64-v8.2-a build=native --jobs=64 build=native --silent fixed_format_kernels=True Werror=0 

benchdnn build command:

ACL_ROOT_DIR=$PWD/../ComputeLibrary cmake -B build -DCMAKE_BUILD_TYPE=Release -DDNNL_USE_ACL=ON -DCMAKE_RULE_MESSAGES=OFF -DACL_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute.so -DACL_CORE_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute.so -DACL_GRAPH_LIBRARY=$PWD/../ComputeLibrary/build/libarm_compute_graph.so -DDNNL_CPU_RUNTIME=OMP
cmake --build build --target benchdnn --parallel $(nproc)

Reproducer commands:

taskset -c 0 ./benchdnn --max-ms-per-prb=10e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=direct --dt=f16:f16:f16 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1
taskset -c 0 ./benchdnn --max-ms-per-prb=10e3 --mode=P --conv --reset --allow-enum-tags-only=0 --engine=cpu --dir=FWD_I --alg=direct --dt=f32:f32:f32 --stag=acdb --wtag=any --dtag=acdb --attr-scratchpad=user mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1

NHWC layout recommended by ACL is used in reproducer.
taskset is used to force single thread mode and avoid threading issues.

The 1st command (f16 convolution) gives 0.267766 ms, the 2nd one (f32 convolution) gives 0.273554 ms on Ampere.
I'd expect better f16 convolution performance.

If reproducer command is called with DNNL_VERBOSE=1 then we observe 2 convolutions in f16 case:

onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f16::blocked:acdb::f0 wei_f16:ap:blocked:Acdb8a::f0 bia_undef::undef::: dst_f16::blocked:acdb::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1,0.501953
onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f32:a:blocked:acdb::f0 wei_f32:a:blocked:Acdb4a::f0 bia_undef::undef::: dst_f32:a:blocked:acdb::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1,0.444824

and 1 convolution in fp32 case:

onednn_verbose,primitive,exec,cpu,convolution,indirect_gemm:acl,forward_inference,src_f32::blocked:acdb::f0 wei_f32:a:blocked:Acdb4a::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user,alg:convolution_direct,mb1_ic4oc12_ih70oh70kh3sh1dh0ph1_iw70ow70kw3sw1dw0pw1,0.112061

It's not clear the purpose of the 2nd convolution in f16 case (moreover, it's f32 convolution). Probably, it's ACL integration in oneDNN issue rather than ACL issue.

@alvoron
Copy link
Author

alvoron commented Aug 28, 2024

@morgolock I double checked the issue description and I think, I can't provide standalone ACL reproducer.
Perhaps this issue needs to be reviewed from oneDNN integration point of view, since oneDNN calls 2 convolution primitives in fp16 case and only 1 primitive in fp32 case.
So, probably, it's not ACL issue, but ACL integration into oneDNN issue.
Should we ask Milos to take a look at this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants