Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid OpenCL version and non-cpp source language lead to crash while generating device-dependent OpenCL binaries from a valid SPIR-V input #2515

Open
VyacheslavLevytskyy opened this issue Apr 19, 2024 · 9 comments

Comments

@VyacheslavLevytskyy
Copy link
Contributor

VyacheslavLevytskyy commented Apr 19, 2024

At the moment, generation of device-dependent OpenCL program binary from a valid SPIR-V input crashes in two slightly different, but related cases:

  • when Translator generates a wrong OpenCL version in LLVM IR output during SPIR-V -> LLVM IR translation, and
  • when SPIR-V input defines OpenCL_C source language rather than OpenCL_CPP.
  1. Let's consider a simple reproducer, referred further as ocl-ver-crash.ll:
target datalayout = "e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-n8:16:32:64"
target triple = "spir64-unknown-unknown"

define weak_odr dso_local spir_kernel void @foo() {
entry:
  %myval.i = alloca float, align 4
  %myval.ascast.i = addrspacecast ptr %myval.i to ptr addrspace(4)
  store float 0.000000e+00, ptr %myval.i, align 4
  %call.i.i.i = call spir_func float @_Z21__spirv_AtomicFMaxEXT(ptr addrspace(4) %myval.ascast.i, i32 1, i32 896, float 1.230000e+02)
  ret void
}

declare dso_local spir_func float @_Z21__spirv_AtomicFMaxEXT(ptr addrspace(4), i32, i32, float)

After creating SPIR-V by llvm-as ocl-ver-crash.ll -o - | llvm-spirv --spirv-ext=+SPV_EXT_shader_atomic_float_min_max -o ocl-ver-crash.spv and running AOT by opencl-aot ocl-ver-crash.spv --device=cpu --cmd=build we observe a crash of the build followed by this confusing diagnostics:

Failed to build device program
CompilerException Failed to lookup symbol foo
JIT session error: Symbols not found: [ _Z10atomic_maxPU3AS4Vff ]
Failed to materialize symbols: { (main, { foo }) }

Let's look into LLVM IR that is produced from the problematic SPIRV (llvm-spirv ocl-ver-crash.spv -r -o - | llvm-dis -o ocl-ver-crash-to-run.ll):

; ModuleID = '<stdin>'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024"
target triple = "spir64-unknown-unknown"

; Function Attrs: nounwind
define spir_kernel void @foo() #0 !kernel_arg_addr_space !3 !kernel_arg_access_qual !3 !kernel_arg_type !3 !kernel_arg_type_qual !3 !kernel_arg_base_type !3 !spirv.ParameterDecorations !3 {
entry:
  %myval.i = alloca float, align 4
  %myval.ascast.i = addrspacecast ptr %myval.i to ptr addrspace(4)
  store float 0.000000e+00, ptr %myval.i, align 4
  %call.i.i.i = call spir_func float @_Z10atomic_maxPU3AS4Vff(ptr addrspace(4) %myval.ascast.i, float 1.230000e+02) #0
  ret void
}

; Function Attrs: nounwind
declare spir_func float @_Z10atomic_maxPU3AS4Vff(ptr addrspace(4), float) #0

attributes #0 = { nounwind }

!spirv.MemoryModel = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!spirv.Source = !{!1}
!opencl.spir.version = !{!2}
!opencl.ocl.version = !{!1}
!opencl.used.extensions = !{!3}
!opencl.used.optional.core.features = !{!3}
!spirv.Generator = !{!4}

!0 = !{i32 2, i32 2}
!1 = !{i32 0, i32 0}
!2 = !{i32 1, i32 2}
!3 = !{}
!4 = !{i16 6, i16 14}

There is a correct declaration and call to @_Z10atomic_maxPU3AS4Vff, however, generation of the device-dependent OpenCL program binary fails due to the wrong OpenCL version:

!opencl.ocl.version = !{!1}
!1 = !{i32 0, i32 0}

As a side note, this invalid version would not be a problem for AOT/JIT when using the atomic buildin with global address space, but for generic address space evidently there are requirements to the version.

We can see from ocl-ver-crash.spv contents why Translator creates a wrong OpenCL version:

...
OpSource Unknown 0
...

Such an input along with a logic of SPIRVToLLVM::transSourceLanguage():

addOCLVersionMetadata(Context, M, kSPIR2MD::OCLVer, Major, Minor);

lead to 0.0 OpenCL version.

This is the first part of the issue. It has a clear path forward how to fix it, by not allowing producing invalid OpenCL version in SPIRVToLLVM::transSourceLanguage().

  1. Let's set source language version (1.0) inside SPIRV file as the following and create a new SPIRV input ocl-ver-crash-ver1.0.spv:
OpSource OpenCL_C 100000

Translator generates a valid OpenCL 1.0 version this time, and refers to OpenCL_C source language (that is 3 in !1 = !{i32 3, i32 100000}):

...
  %call.i.i.i = call spir_func float @_Z10atomic_maxPU3AS4Vff(ptr addrspace(4) %myval.ascast.i, float 1.230000e+02) #0
...
declare spir_func float @_Z10atomic_maxPU3AS4Vff(ptr addrspace(4), float) #0
...
!spirv.Source = !{!1}
!1 = !{i32 3, i32 100000}
!opencl.ocl.version = !{!3}
!3 = !{i32 1, i32 0}

Running opencl-aot ocl-ver-crash-ver1.0.spv --device=cpu --cmd=build we get the same diagnostic, as before, no changes to better:

CompilerException Failed to lookup symbol foo
JIT session error: Symbols not found: [ _Z10atomic_maxPU3AS4Vff ]
Failed to materialize symbols: { (main, { foo }) }

Let's get back to SPIRV input and set OpSource using CPP rather than C source language, and create a new SPIRV input ocl-ver-crash-ver1.0CPP.spv:

OpSource OpenCL_CPP 100000

Output LLVM IR after llvm-spirv ocl-ver-crash-ver1.0CPP.spv -r -o - | llvm-dis -o ocl-ver-crash-ver1.0CPP.ll is:

!spirv.Source = !{!1}
!opencl.ocl.version = !{!3}
!1 = !{i32 4, i32 100000}
!3 = !{i32 1, i32 0}

The only change is a reference to CPP language (value 4):

git diff ocl-ver-crash-ver1.0.ll ocl-ver-crash-ver1.0CPP.ll
diff --git a/ocl-ver-crash-ver1.0.ll b/ocl-ver-crash-ver1.0CPP.ll
index 458526c..c00c6e2 100644
--- a/ocl-ver-crash-ver1.0.ll
+++ b/ocl-ver-crash-ver1.0CPP.ll

 !0 = !{i32 2, i32 2}
-!1 = !{i32 3, i32 100000}
+!1 = !{i32 4, i32 100000}
 !2 = !{i32 1, i32 2}

Let's try the new version of SPIRV input (ocl-ver-crash-ver1.0CPP.spv). Now opencl-aot ocl-ver-crash-ver1.0CPP.spv --device=cpu --cmd=build executes successfully, the crash has gone.

This chain of input SPIRV transformation shows a dependency of the issue on C/C++ source language, independent from OpenCL version issue, suggests C on mangling vs. C++ mangling, and it also may (or may not) be related to #2513.

Note, that there is no user-defined mangling neither in the input LLVM IR, where builtin is referred to as @_Z21__spirv_AtomicFMaxEXT, not in generated SPIR-V code, where it's transformed to OpAtomicFMaxEXT. This points out to the second issue in Translator that needs further investigation and maybe further discussion.

FYI @MrSidims @svenvh @asudarsa

@asudarsa
Copy link
Contributor

Hi @VyacheslavLevytskyy

Thanks for opening this issue. I think this documentation in SPIR-V spec needs to be considered in this discussion.
https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpSource

It says:
OpSource - Document what source language and text this module was translated from. This has no semantic impact and can safely be removed from a module.

Thanks

@VyacheslavLevytskyy
Copy link
Contributor Author

Thank you @asudarsa , a good point! For me it sounds like an argument that a difference just in OpSource parameters (or absence of OpSource) can't be tolerated as a reason of crash of device-dependent binaries building. In other words, Translator has to account for the attached reproducer in a way that guarantees successful building of the reproducer with all mentioned in the description options for OpSource arguments: without a version (or without OpSource itself), with a version but C language instead of CPP, and with a version and CPP source language.

@MrSidims
Copy link
Contributor

MrSidims commented Apr 19, 2024

Thanks for the report. Lets go step by step.

  1. I agree with Arvind, OpSource must not affect semantics of the program and used for debug info only. Yet right now it affects translation flow, otherwise why we would need to commit 80dfd86 ? So we have to either change this logic or live with it a while longer.
  2. "JIT session error: Symbols not found: [ _Z10atomic_maxPU3AS4Vff ]" so this error comes from some device compiler, right? So some device compiler relies on the metadata to (I assume) to define which builtin library to load and link. If we use another device compiler, would we see the same behavior or not? Either way it's also a good idea to talk with some device compiler developers and clarify if my theory is correct.
  3. While OpSource doesn't carry semantic information, I'm not so sure about the metadata, https://registry.khronos.org/SPIR/specs/spir_spec-2.0.pdf doesn't say it's explicitly, also it doesn't say explicitly if it's obligatory metadata to be generated by frontend or not (at least I can't find such wording). Lets assume the worst case scenario and that is obligatory metadata. If so, is the translator obliged to generate it using OpenCL version and if yes, what is the default value? I don't have a strong opinion about that.

While I was writing the comment Slava has posted one more, so:

can't be tolerated as a reason of crash of device-dependent binaries building

Not a crash, but unresolved symbol after supposed linking with builtin library, which is failure in device's compiler, and following that:

Translator has to account for the attached reproducer in a way that guarantees successful building of the reproducer with all mentioned in the description

How? We don't know the target device and with a further extend its compiler and this compiler's implementation details (dependency on opencl.version metadata is an implementation detail).

I agree, that if we generate opencl.version metadata, then it's value should be not 0.0, but something more valid. What I'm not sure about is that if we must to generate it and what would be the default value. Probably better examination of https://registry.khronos.org/SPIR/specs/spir_spec-2.0.pdf is required. @svenvh may be you have some ideas?

@svenvh
Copy link
Member

svenvh commented Apr 19, 2024

I wonder if we should avoid relying on metadata at all, and instead tell the translator through a command line option what to do? See also #792.

@MrSidims
Copy link
Contributor

MrSidims commented Apr 19, 2024

we should avoid relying on metadata at all

Definitely! It should be indeed looking like OpenCL's -cl-std option. Yet what is unclear to me is that if in this very case SPIRVToOCL makes something funny, or is it just about device's compiler way to link builtin libraries? For instance _Z10atomic_maxPU3AS4Vff suggests, that it uses Generic pointer, which was introduced in OpenCL 2.0 and hence without the metadata it might so happen, that 1.0 library is linked, which wouldn't have AS4 mangling. Probably this option may also control this metadata generation (if it's obligatory, yet after reading spir spec I'm unsure).

@asudarsa
Copy link
Contributor

asudarsa commented Apr 19, 2024

I agree with Arvind, OpSource must not affect semantics of the program and used for debug info only. Yet right now it affects translation flow, otherwise why we would need to commit 80dfd86 ? So we have to either change this logic or live with it a while longer.

One quick pointer here. 80dfd86 does not rely on OpSource. It relies on metadata provided in LLVM IR. I think we should avoid relying on OpSource during reverse translation. But we can rely on LLVM metadata for forward translation. If we stop relying on metadata during forward translation, we have to revert/fix the entire OCLToSPIRV flow which is predicated by checking if source is OpenCL C.

@asudarsa
Copy link
Contributor

I agree with Arvind, OpSource must not affect semantics of the program and used for debug info only. Yet right now it affects translation flow, otherwise why we would need to commit 80dfd86 ? So we have to either change this logic or live with it a while longer.

One quick pointer here. 80dfd86 does not rely on OpSource. It relies on metadata provided in LLVM IR. I think we should avoid relying on OpSource during reverse translation. But we can rely on LLVM metadata for forward translation. If we stop relying on metadata during forward translation, we have to revert/fix the entire OCLToSPIRV flow which is predicated by checking if source is OpenCL C.

+1 for using -cl-std like option instead of metadata.

@VyacheslavLevytskyy
Copy link
Contributor Author

I agree with Arvind, OpSource must not affect semantics of the program and used for debug info only. Yet right now it affects translation flow, otherwise why we would need to commit 80dfd86 ? So we have to either change this logic or live with it a while longer.

One quick pointer here. 80dfd86 does not rely on OpSource. It relies on metadata provided in LLVM IR. I think we should avoid relying on OpSource during reverse translation. But we can rely on LLVM metadata for forward translation. If we stop relying on metadata during forward translation, we have to revert/fix the entire OCLToSPIRV flow which is predicated by checking if source is OpenCL C.

+1 for using -cl-std like option instead of metadata.

Yes, I'd also vote for this option.

@haonanya
Copy link

opencl-aot ocl-ver-crash.spv --device=cpu --cmd=build doesn't offer openCL version, by default it's CL1.2. When translate spirv to llvm IR, llvm-spirv -r --spirv-target-env=CL1.2 ocl-ver-crash.spv is performed. c++filt _Z10atomic_maxPU3AS4Vff gets atomic_max(float volatile AS4*, float) which is not found in https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/opencl-c.h. opencl-aot ocl-ver-crash.spv --device=cpu --cmd=build --bo="-cl-std=CL2.0" doesn't crash. https://github.com/KhronosGroup/SPIRV-LLVM-Translator/blob/main/test/extensions/EXT/SPV_EXT_shader_atomic_float_/AtomicFMaxEXT.ll#L8 has output for different spirv-target-env.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants