Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][LLPC] Always move non-uniform descriptor loads inside the waterfall loop #2859

Closed
wants to merge 2 commits into from

Conversation

kmitropoulou
Copy link
Contributor

@kmitropoulou kmitropoulou commented Dec 1, 2023

Currently, we bail-out scalarization if one of the operands of the image call is uniform. In this patch, we enable the scalarization only for the non-uniform operands. To do this, I refactored the createWaterfallLoop() .

@amdvlk-admin
Copy link
Collaborator

Test summary for commit fe0df0b

CTS tests (Failed: 0/138378)
  • Built with version 1.3.5.2
  • Ubuntu navi3x, Srdcvk
    • Passed: 35162/69163 (50.8%)
    • Failed: 0/69163 (0.0%)
    • Not Supported: 34001/69163 (49.2%)
    • Warnings: 0/69163 (0.0%)
    Ubuntu navi2x, Srdcvk
    • Passed: 35242/69215 (50.9%)
    • Failed: 0/69215 (0.0%)
    • Not Supported: 33973/69215 (49.1%)
    • Warnings: 0/69215 (0.0%)

@amdvlk-admin
Copy link
Collaborator

aaa28a5 Jenkins build error.
/jenkins/workspace/vulkan/sanitized-opensource/Github-PR/llpc-github-pr/driver_build/drivers/llpc/lgc/builder/BuilderImpl.cpp:722:30: error: ‘get32BitNonUniformIndex’ was not declared in this scope; did you mean ‘traceNonUniformIndex’?
722 | Value new32BitValue = get32BitNonUniformIndex(nonUniformIndex);
| ^~~~~~~~~~~~~~~~~~~~~~~
| traceNonUniformIndex
/jenkins/workspace/vulkan/sanitized-opensource/Github-PR/llpc-github-pr/driver_build/drivers/llpc/lgc/builder/BuilderImpl.cpp:729:19: error: ‘getSharedIndex’ was not declared in this scope; did you mean ‘sharedIndex’?
729 | sharedIndex = getSharedIndex(nonUniformIndices, nonUniformIndex32BitVal, traceNonUniformIndex, nonUniformInst);
| ^~~~~~~~~~~~~~
| sharedIndex
/jenkins/workspace/vulkan/sanitized-opensource/Github-PR/llpc-github-pr/driver_build/drivers/llpc/lgc/builder/BuilderImpl.cpp: At global scope:
/jenkins/workspace/vulkan/sanitized-opensource/Github-PR/llpc-github-pr/driver_build/drivers/llpc/lgc/builder/BuilderImpl.cpp:641:13: error: ‘bool instructionsEqual(llvm::Instruction
, llvm::Instruction*)’ defined but not used [-Werror=unused-function]
641 | static bool instructionsEqual(Instruction lhs, Instruction rhs) {
| ^~~~~~~~~~~~~~~~~
cc1plus: some warnings being treated as errors
[136/322] Building CXX object compiler/llpc/llvm/tools/Continuations/CMakeFiles/LLVMContinuations.dir/lib/RegisterBuffer.cpp.o
/jenkins/workspace/vulkan/sanitized-opensource/Github-PR/llpc-github-pr/driver_build/drivers/llpc/shared/continuations/lib/RegisterBuffer.cpp: In member function ‘llvm::Value
llvm::RegisterBufferPass::computeMemAddr(llvm::IRBuilder<>&, llvm::Value
)’:

@amdvlk-admin
Copy link
Collaborator

Test summary for commit f1f1070

CTS tests (Failed: 0/138378)
  • Built with version 1.3.5.2
  • Ubuntu navi3x, Srdcvk
    • Passed: 35162/69163 (50.8%)
    • Failed: 0/69163 (0.0%)
    • Not Supported: 34001/69163 (49.2%)
    • Warnings: 0/69163 (0.0%)
    Ubuntu navi2x, Srdcvk
    • Passed: 35242/69215 (50.9%)
    • Failed: 0/69215 (0.0%)
    • Not Supported: 33973/69215 (49.1%)
    • Warnings: 0/69215 (0.0%)

@piotrAMD
Copy link
Contributor

piotrAMD commented Dec 5, 2023

I understand this PR supersedes #2759 as the first commit (195b936) is the same as in #2759. Can you describe what changes are being made in the other one (f1f1070)?

@kmitropoulou
Copy link
Contributor Author

I understand this PR supersedes #2759 as the first commit (195b936) is the same as in #2759. Can you describe what changes are being made in the other one (f1f1070)?

The first patch (#2759) has an initial implementation for the scalarization of descriptor loads. This patch enables the scalarization of the non-uniform descriptor loads even if one of the other operand is uniform.

@kmitropoulou
Copy link
Contributor Author

ping

@amdvlk-admin
Copy link
Collaborator

Test summary for commit bd31f59

CTS tests (Failed: 0/138378)
  • Built with version 1.3.5.2
  • Ubuntu navi3x, Srdcvk
    • Passed: 35162/69163 (50.8%)
    • Failed: 0/69163 (0.0%)
    • Not Supported: 34001/69163 (49.2%)
    • Warnings: 0/69163 (0.0%)
    Ubuntu navi2x, Srdcvk
    • Passed: 35241/69215 (50.9%)
    • Failed: 0/69215 (0.0%)
    • Not Supported: 33974/69215 (49.1%)
    • Warnings: 0/69215 (0.0%)

@kmitropoulou
Copy link
Contributor Author

ping

// @param nonUniformIndex : the non-uniform index of the non-uniform operand of the image call
// @return : the 32-bit value of the nonUniformIndex
Value *get32BitNonUniformIndex(Value *nonUniformIndex) {
if (nonUniformIndex->getType()->isIntegerTy(64)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only invocation of get32BitNonUniformIndex is guarded with the same check nonUniformIndex->getType()->isIntegerTy(64). Either remove it there or keep it and add an assert here.

DenseMap<Value *, Value *> nonUniformIndex32BitVal;
for (auto nonUniformIndex : nonUniformIndices) {
// Start the waterfall loop using the waterfall index.
if (nonUniformIndex->getType()->isIntegerTy(64)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can nonUniformIndex be a 16-bit value? If so, then it needs to be handled too.

Copy link
Contributor Author

@kmitropoulou kmitropoulou Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know. The original code does not take it into consideration. @nhaehnle @perlfu : What do you think about this?

// @param nonUniformInst : image call
// @return : the 32-bit value of the nonUniformIndex
Value *getSharedIndex(ArrayRef<Value *> nonUniformIndices, DenseMap<Value *, Value *> &nonUniformIndex32BitVal,
TraceNonUniformIndex &traceNonUniformIndex, Instruction *nonUniformInst) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused nonUniformInst?
(However the function declares a local variable nonUniformInstr).

Copy link
Contributor Author

@kmitropoulou kmitropoulou Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for spotting it. Now, I renamed this function to calculateSharedIndex() and it is a method of TraceNonUniformIndex. As a result, I do not need to pass all these arguments.

@@ -688,109 +741,95 @@ Instruction *BuilderImpl::createWaterfallLoop(Instruction *nonUniformInst, Array
SmallVector<Value *, 2> nonUniformIndices;
// Maps the nonUniformIndex that is returned by traceNonUniformIndex() to the nonUniformInst.
DenseMap<Value *, std::pair<Value *, unsigned>> nonUniformIndexImageCallOperand;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is nonUniformIndexImageCallOperand map being actually used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Good catch :)

@@ -688,109 +741,95 @@ Instruction *BuilderImpl::createWaterfallLoop(Instruction *nonUniformInst, Array
SmallVector<Value *, 2> nonUniformIndices;
// Maps the nonUniformIndex that is returned by traceNonUniformIndex() to the nonUniformInst.
DenseMap<Value *, std::pair<Value *, unsigned>> nonUniformIndexImageCallOperand;
TraceNonUniformIndex traceNonUniformIndex(nonUniformInst, scalarizeDescriptorLoads, 64);
TraceNonUniformIndex traceNonUniformIndex(nonUniformInst, scalarizeDescriptorLoads);
DenseMap<unsigned, Value *> operandIdxnonUniformIndex;
Copy link
Contributor

@piotrAMD piotrAMD Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think the naming convention is operandIdxNonUniformIndex, although I dislike having both Idx and Index at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)


if (origNonUniformVal == nonUniformImageCallOperand)
continue;
for (unsigned operandIdx : operandIdxs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to loop through operandIdxs rather than nonUniformIndices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The amdgcn.waterfall.begin intrinsic has the nonUniformIndex as a parameter. The amdgcn.waterfall.readfirstlane has two parameters:

  1. the first argument is the amdgcn.waterfall.begin intrinsic
  2. in case of scalarization, the second argument is the nonUniformIndex. But, if there is not a nonUniformIndex or if the scalrarization is disabled, then the second argument is the operand of the non-uniform instruction. The later is calculated with the help operandIdxs.

attributes #2 = { nounwind memory(none) }
attributes #3 = { nounwind memory(write) }

!lgc.client = !{!0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can you remove the metadata? I think most of them are irrelevant.

Copy link
Contributor Author

@kmitropoulou kmitropoulou Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove them, then the code does not run through llpc pipeline and the scalarization is not enabled.

; Function Attrs: nounwind
define dllexport spir_func void @lgc.shader.VS.main() local_unnamed_addr #0 !spirv.ExecutionModel !14 !lgc.shaderstage !15 {
.entry:
%0 = call <4 x i32> (...) @lgc.create.read.generic.input.v4i32(i32 2, i32 0, i32 0, i32 0, i32 0, i32 poison)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our tests normally have names for the instructions rather than numbers (that limits the potential diffs in the test in the future). You can do the automatic conversion by doing "opt -instnamer" on a test file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

@@ -0,0 +1,113 @@
; NOTE: Assertions have been autogenerated by tool/update_llpc_test_checks.py UTC_ARGS: --tool lgc
; RUN: lgc -mcpu=gfx1010 -print-after=lgc-builder-replayer -o - %s 2>&1 | FileCheck --check-prefixes=CHECK %s
; ModuleID = 'lgcPipeline'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be really useful if you could add a comment that says what each test does, for example: "Make sure that there is a waterfall loop where..". Alternatively, use a more descriptive test name (file name).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments at the top of the tests.

@kmitropoulou kmitropoulou changed the title [LLPC] Scalarize only the non-uniform load descriptors of an image call [WIP][LLPC] Always move non-uniform descriptor loads inside the waterfall loop Dec 14, 2023
@amdvlk-admin
Copy link
Collaborator

Test summary for commit 6ffd47f

CTS tests (Failed: 568/138443)
  • Built with version 1.3.5.2
  • Ubuntu navi3x, Srdcvk
    • Passed: 35154/69228 (50.8%)
    • Failed: 57/69228 (0.1%)

      Failures:

      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_bool_requiredsubgroupsize64
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_dvec2_requiredsubgroupsize32
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_dvec4
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_f16vec2
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_bvec2
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_bvec3
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_bvec4
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_double_requiredsubgroupsize32
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_dvec2
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_dvec4_requiredsubgroupsize64
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_f16vec4_requiredsubgroupsize64
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_float16_t
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_float_requiredsubgroupsize64
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_i16vec4_requiredsubgroupsize32
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_i8vec2_requiredsubgroupsize64
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_u16vec4_requiredsubgroupsize64
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_nonconst_vec3_requiredsubgroupsize32
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_u16vec3_requiredsubgroupsize64
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.compute.subgroupbroadcast_u8vec4_requiredsubgroupsize64
      Stack trace: Crash
      
      FAILURE: dEQP-VK.subgroups.ballot_broadcast.framebuffer.subgroupbroadcast_i16vec3geometry
      Stack trace: Crash
      ...
      

    • Not Supported: 34017/69228 (49.1%)
    • Warnings: 0/69228 (0.0%)
    Ubuntu navi2x, Srdcvk
    • Passed: 34731/69215 (50.2%)
    • Failed: 511/69215 (0.7%)

      Failures:

      FAILURE: dEQP-VK.memory_model.message_passing.core11.u32.coherent.control_barrier.atomicwrite.subgroup.payload_nonlocal.image.guard_nonlocal.buffer.comp
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.subgroup.payload_nonlocal.image.guard_local.buffer.frag
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.subgroup.payload_nonlocal.workgroup.guard_nonlocal.image.comp
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicrmw.subgroup.payload_local.buffer.guard_local.physbuffer.frag
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicrmw.subgroup.payload_local.buffer.guard_nonlocal.image.frag
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicrmw.subgroup.payload_local.physbuffer.guard_local.physbuffer.comp
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicrmw.subgroup.payload_local.physbuffer.guard_nonlocal.image.comp
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicrmw.subgroup.payload_local.physbuffer.guard_nonlocal.image.frag
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicrmw.subgroup.payload_nonlocal.buffer.guard_local.image.vert
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicrmw.subgroup.payload_nonlocal.buffer.guard_local.physbuffer.frag
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicrmw.subgroup.payload_nonlocal.buffer.guard_nonlocal.image.comp
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicrmw.subgroup.payload_nonlocal.image.guard_local.image.vert
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicwrite.subgroup.payload_nonlocal.buffer.guard_nonlocal.physbuffer.comp
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.coherent.atomic_atomic.atomicwrite.subgroup.payload_nonlocal.workgroup.guard_nonlocal.image.comp
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.noncoherent.atomic_atomic.atomicrmw.subgroup.payload_local.image.guard_nonlocal.image.vert
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.noncoherent.atomic_atomic.atomicrmw.subgroup.payload_local.physbuffer.guard_local.buffer.vert
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.noncoherent.atomic_atomic.atomicrmw.subgroup.payload_nonlocal.image.guard_nonlocal.image.vert
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.noncoherent.atomic_atomic.atomicrmw.subgroup.payload_nonlocal.physbuffer.guard_local.image.vert
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.noncoherent.atomic_atomic.atomicrmw.subgroup.payload_nonlocal.physbuffer.guard_nonlocal.physbuffer.frag
      Stack trace: Crash
      
      FAILURE: dEQP-VK.memory_model.message_passing.ext.f32.noncoherent.atomic_atomic.atomicwrite.subgroup.payload_local.buffer.guard_nonlocal.physbuffer.frag
      Stack trace: Crash
      ...
      

    • Not Supported: 33973/69215 (49.1%)
    • Warnings: 0/69215 (0.0%)

@amdvlk-admin
Copy link
Collaborator

ef07012 Jenkins build error.
/jenkins/workspace/vulkan/sanitized-opensource/Github-PR/llpc-github-pr/driver_build/drivers/llpc/shared/continuations/lib/DXILCont.cpp:314:19: error: ‘class llvm::IRBuilder<>’ has no member named ‘getInt8PtrTy’; did you mean ‘getIntPtrTy’?
314 | auto *PtrTy = B.getInt8PtrTy(static_cast<uint32_t>(*StackAddrspace));
| ^~~~~~~~~~~~
| getIntPtrTy
[131/322] Building CXX object compiler/llpc/CMakeFiles/llpcinternal.dir/translator/lib/SPIRV/libSPIRV/SPIRVEntry.cpp.o
[132/322] Building CXX object compiler/llpc/llvm/tools/Continuations/CMakeFiles/LLVMContinuations.dir/lib/LgcRtDialect.cpp.o

/jenkins/workspace/vulkan/sanitized-opensource/Github-PR/llpc-github-pr/driver_build/drivers/llpc/shared/continuations/lib/RegisterBuffer.cpp:197:22: error: ‘getWithSamePointeeType’ is not a member of ‘llvm::PointerType’
197 | PointerType::getWithSamePointeeType(
| ^~~~~~~~~~~~~~~~~~~~~~
/jenkins/workspace/vulkan/sanitized-opensource/Github-PR/llpc-github-pr/driver_build/drivers/llpc/shared/continuations/lib/RegisterBuffer.cpp:205:24: error: ‘getWithSamePointeeType’ is not a member of ‘llvm::PointerType’
205 | PointerType::getWithSamePointeeType(
| ^~~~~~~~~~~~~~~~~~~~~~
/jenkins/workspace/vulkan/sanitized-opensource/Github-PR/llpc-github-pr/driver_build/drivers/llpc/shared/continuations/lib/RegisterBuffer.cpp: In member function ‘llvm::Value* llvm::RegisterBufferPass::handleSingleLoadStore(llvm::IRBuilder<>&, llvm::Type*, llvm::Value*, llvm::Value*, llvm::Align, llvm::AAMDNodes, bool)’:
/jenkins/workspace/vulkan/sanitized-opensource/Github-PR/llpc-github-pr/driver_build/drivers/llpc/shared/continuations/lib/RegisterBuffer.cpp:245:29: error: ‘getWithSamePointeeType’ is not a member of ‘llvm::PointerType’
245 | Address, PointerType::getWithSamePointeeType(AddressType,
| ^~~~~~~~~~~~~~~~~~~~~~
[135/322] Building CXX object compiler/llpc/llvm/tools/Continuations/CMakeFiles/LLVMContinuations.dir/lib/LowerAwait.cpp.o
[136/322] Building CXX object compiler/llpc/llvm/tools/Continuations/CMakeFiles/LLVMContinuations.dir/lib/LegacyCleanupContinuations.cpp.o
[137/322] Building CXX object compiler/llpc/llvm/tools/Continuations/CMakeFiles/LLVMContinuations.dir/lib/RemoveTypesMetadata.cpp.o

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants