generic `@llvm.ssub.sat` optimizes less well than target-specific `@llvm.aarch64.neon.sqsub` #94463

folkertdev · 2024-06-05T12:18:41Z

The generic @llvm.ssub.sat.v2i64 intrinsic optimizes less well than the target-specific @llvm.aarch64.neon.sqsub.v2i64 intrinsic.

This godbolt shows the issue https://godbolt.org/z/4qEe3xM9v

We see two functions that use these two saturating subtractions, but are otherwise the same. they generate very similar initial LLVM IR, except that the generic function is called in a slightly different way (requiring some extra allocas).

define void @specific(ptr dead_on_unwind noalias nocapture noundef writable sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %a, ptr noalias nocapture noundef align 16 dereferenceable(16) %b, ptr noalias nocapture noundef align 16 dereferenceable(16) %c) unnamed_addr {
start:
  %0 = alloca [16 x i8], align 16
  %1 = alloca [16 x i8], align 16
  %2 = alloca [16 x i8], align 16
  %3 = alloca [16 x i8], align 16
  %4 = alloca [16 x i8], align 16
  call void @llvm.lifetime.start.p0(i64 16, ptr %4)
  %5 = load <4 x i32>, ptr %b, align 16
  store <4 x i32> %5, ptr %3, align 16
  %6 = load <4 x i32>, ptr %c, align 16
  store <4 x i32> %6, ptr %2, align 16
  call void @core::core_arch::aarch64::neon::generated::vqdmull_high_laneq_s32::h66aa645ca0aefe90(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %4, ptr noalias nocapture noundef align 16 dereferenceable(16) %3, ptr noalias nocapture noundef align 16 dereferenceable(16) %2)
  %_4 = load <2 x i64>, ptr %4, align 16
  call void @llvm.lifetime.end.p0(i64 16, ptr %4)
  %7 = load <2 x i64>, ptr %a, align 16
  store <2 x i64> %7, ptr %1, align 16
  store <2 x i64> %_4, ptr %0, align 16
  ; after InlinerPass this call becomes a call to `@llvm.aarch64.neon.sqsub.v2i64` 
  call void @core::core_arch::arm_shared::neon::generated::vqsubq_s64::h1887dd6c0650937c(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %1, ptr noalias nocapture noundef align 16 dereferenceable(16) %0)
  ret void
}

define void @generic(ptr dead_on_unwind noalias nocapture noundef writable sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %a, ptr noalias nocapture noundef align 16 dereferenceable(16) %b, ptr noalias nocapture noundef align 16 dereferenceable(16) %c) unnamed_addr {
start:
  %0 = alloca [16 x i8], align 16
  %1 = alloca [16 x i8], align 16
  %2 = alloca [16 x i8], align 16
  call void @llvm.lifetime.start.p0(i64 16, ptr %2)
  %3 = load <4 x i32>, ptr %b, align 16
  store <4 x i32> %3, ptr %1, align 16
  %4 = load <4 x i32>, ptr %c, align 16
  store <4 x i32> %4, ptr %0, align 16
  call void @core::core_arch::aarch64::neon::generated::vqdmull_high_laneq_s32::h66aa645ca0aefe90(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %2, ptr noalias nocapture noundef align 16 dereferenceable(16) %1, ptr noalias nocapture noundef align 16 dereferenceable(16) %0)
  %_4 = load <2 x i64>, ptr %2, align 16
  call void @llvm.lifetime.end.p0(i64 16, ptr %2)
  %5 = load <2 x i64>, ptr %a, align 16
  %6 = call <2 x i64> @llvm.ssub.sat.v2i64(<2 x i64> %5, <2 x i64> %_4)
  store <2 x i64> %6, ptr %_0, align 16
  ret void
}

As a user, I expect the generic variant to eventually be lowered to the specific one.

When the intrinsics are used on their own, this is in fact the case: https://godbolt.org/z/ErjETo3bh. Both functions emit the sqsub instruction. This logic appears to be implemented here.

But in my example, there is an optimization that @llvm.aarch64.neon.sqsub.v2i64 participates in, but the generic @llvm.ssub.sat.v2i64 does not.

specific:
        ldr     q0, [x1]
        ldr     q1, [x2]
        ldr     q2, [x0]
        sqdmlsl2        v2.2d, v0.4s, v1.s[1]
        str     q2, [x8]
        ret

generic:
        ldr     q0, [x1]
        ldr     q1, [x2]
        sqdmull2        v0.2d, v0.4s, v1.s[1]
        ldr     q1, [x0]
        sqsub   v0.2d, v1.2d, v0.2d
        str     q0, [x8]
        ret

This is unexpected, and seems to imply that many optimizations are missed when using the generic SIMD intrinsics (at least for Neon). My intuition is that the lowering of the generic to the specific instruction occurs too late. It should occur earlier so that it can participate in backend-specific optimizations.

discussion in the rust zullip
PR where this problem was spotted use simd_saturating_{add, sub} on neon rust-lang/stdarch#1575

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-06-05T16:35:27Z

@llvm/issue-subscribers-backend-aarch64

Author: Folkert de Vries (folkertdev)

The generic `@llvm.ssub.sat.v2i64` intrinsic optimizes less well than the target-specific `@llvm.aarch64.neon.sqsub.v2i64` intrinsic.

This godbolt shows the issue https://godbolt.org/z/4qEe3xM9v

We see two functions that use these two saturating subtractions, but are otherwise the same. they generate very similar initial LLVM IR, except that the generic function is called in a slightly different way (requiring some extra allocas).

define void @<!-- -->specific(ptr dead_on_unwind noalias nocapture noundef writable sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %a, ptr noalias nocapture noundef align 16 dereferenceable(16) %b, ptr noalias nocapture noundef align 16 dereferenceable(16) %c) unnamed_addr {
start:
  %0 = alloca [16 x i8], align 16
  %1 = alloca [16 x i8], align 16
  %2 = alloca [16 x i8], align 16
  %3 = alloca [16 x i8], align 16
  %4 = alloca [16 x i8], align 16
  call void @<!-- -->llvm.lifetime.start.p0(i64 16, ptr %4)
  %5 = load &lt;4 x i32&gt;, ptr %b, align 16
  store &lt;4 x i32&gt; %5, ptr %3, align 16
  %6 = load &lt;4 x i32&gt;, ptr %c, align 16
  store &lt;4 x i32&gt; %6, ptr %2, align 16
  call void @<!-- -->core::core_arch::aarch64::neon::generated::vqdmull_high_laneq_s32::h66aa645ca0aefe90(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %4, ptr noalias nocapture noundef align 16 dereferenceable(16) %3, ptr noalias nocapture noundef align 16 dereferenceable(16) %2)
  %_4 = load &lt;2 x i64&gt;, ptr %4, align 16
  call void @<!-- -->llvm.lifetime.end.p0(i64 16, ptr %4)
  %7 = load &lt;2 x i64&gt;, ptr %a, align 16
  store &lt;2 x i64&gt; %7, ptr %1, align 16
  store &lt;2 x i64&gt; %_4, ptr %0, align 16
  ; after InlinerPass this call becomes a call to `@<!-- -->llvm.aarch64.neon.sqsub.v2i64` 
  call void @<!-- -->core::core_arch::arm_shared::neon::generated::vqsubq_s64::h1887dd6c0650937c(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %1, ptr noalias nocapture noundef align 16 dereferenceable(16) %0)
  ret void
}

define void @<!-- -->generic(ptr dead_on_unwind noalias nocapture noundef writable sret([16 x i8]) align 16 dereferenceable(16) %_0, ptr noalias nocapture noundef align 16 dereferenceable(16) %a, ptr noalias nocapture noundef align 16 dereferenceable(16) %b, ptr noalias nocapture noundef align 16 dereferenceable(16) %c) unnamed_addr {
start:
  %0 = alloca [16 x i8], align 16
  %1 = alloca [16 x i8], align 16
  %2 = alloca [16 x i8], align 16
  call void @<!-- -->llvm.lifetime.start.p0(i64 16, ptr %2)
  %3 = load &lt;4 x i32&gt;, ptr %b, align 16
  store &lt;4 x i32&gt; %3, ptr %1, align 16
  %4 = load &lt;4 x i32&gt;, ptr %c, align 16
  store &lt;4 x i32&gt; %4, ptr %0, align 16
  call void @<!-- -->core::core_arch::aarch64::neon::generated::vqdmull_high_laneq_s32::h66aa645ca0aefe90(ptr noalias nocapture noundef sret([16 x i8]) align 16 dereferenceable(16) %2, ptr noalias nocapture noundef align 16 dereferenceable(16) %1, ptr noalias nocapture noundef align 16 dereferenceable(16) %0)
  %_4 = load &lt;2 x i64&gt;, ptr %2, align 16
  call void @<!-- -->llvm.lifetime.end.p0(i64 16, ptr %2)
  %5 = load &lt;2 x i64&gt;, ptr %a, align 16
  %6 = call &lt;2 x i64&gt; @<!-- -->llvm.ssub.sat.v2i64(&lt;2 x i64&gt; %5, &lt;2 x i64&gt; %_4)
  store &lt;2 x i64&gt; %6, ptr %_0, align 16
  ret void
}

As a user, I expect the generic variant to eventually be lowered to the specific one.

When the intrinsics are used on their own, this is in fact the case: https://godbolt.org/z/ErjETo3bh. Both functions emit the sqsub instruction. This logic appears to be implemented here.

But in my example, there is an optimization that @llvm.aarch64.neon.sqsub.v2i64 participates in, but the generic @llvm.ssub.sat.v2i64 does not.

specific:
        ldr     q0, [x1]
        ldr     q1, [x2]
        ldr     q2, [x0]
        sqdmlsl2        v2.2d, v0.4s, v1.s[1]
        str     q2, [x8]
        ret

generic:
        ldr     q0, [x1]
        ldr     q1, [x2]
        sqdmull2        v0.2d, v0.4s, v1.s[1]
        ldr     q1, [x0]
        sqsub   v0.2d, v1.2d, v0.2d
        str     q0, [x8]
        ret

This is unexpected, and seems to imply that many optimizations are missed when using the generic SIMD intrinsics (at least for Neon). My intuition is that the lowering of the generic to the specific instruction occurs too late. It should occur earlier so that it can participate in backend-specific optimizations.

discussion in the rust zullip
PR where this problem was spotted use simd_saturating_{add, sub} on neon rust-lang/stdarch#1575

github-actions bot added the new issue label Jun 5, 2024

dtcxzyw added backend:AArch64 and removed new issue labels Jun 5, 2024

dtcxzyw added the missed-optimization label Jun 5, 2024

folkertdev mentioned this issue Jun 7, 2024

use simd_saturating_{add, sub} on neon rust-lang/stdarch#1575

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generic `@llvm.ssub.sat` optimizes less well than target-specific `@llvm.aarch64.neon.sqsub` #94463

generic `@llvm.ssub.sat` optimizes less well than target-specific `@llvm.aarch64.neon.sqsub` #94463

folkertdev commented Jun 5, 2024

llvmbot commented Jun 5, 2024

generic @llvm.ssub.sat optimizes less well than target-specific @llvm.aarch64.neon.sqsub #94463

generic @llvm.ssub.sat optimizes less well than target-specific @llvm.aarch64.neon.sqsub #94463

Comments

folkertdev commented Jun 5, 2024

llvmbot commented Jun 5, 2024

generic `@llvm.ssub.sat` optimizes less well than target-specific `@llvm.aarch64.neon.sqsub` #94463

generic `@llvm.ssub.sat` optimizes less well than target-specific `@llvm.aarch64.neon.sqsub` #94463