Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLIR][OpenMP] Lower only target related pragmas for GPU MLIR #40

Open
wants to merge 1 commit into
base: amd-trunk-dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,46 @@ findAllocaInsertPoint(llvm::IRBuilderBase &builder,
&funcEntryBlock, funcEntryBlock.getFirstInsertionPt());
}

static bool isOpAllowedToBeLowered(Operation *opInst,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Saying an operation is "not allowed" to be lowered I think implies that it should be an error, which is not the intent. Maybe we can rename this to something more along the lines of "operation should be skipped" or "operation not applicable" or something like that. It doesn't have to be the name I suggest, but something along those lines.

Suggested change
static bool isOpAllowedToBeLowered(Operation *opInst,
static bool shouldSkipOpLowering(Operation *opInst,

If you agree to make this name change, note that the logic would be reversed.

llvm::OpenMPIRBuilder *ompBuilder) {
if (!opInst)
return false;
Comment on lines +132 to +133
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wouldn't be necessary, since the original code doesn't check this.

// omp.target operation can be lowered for host and device MLIR
if (isa<omp::TargetOp>(opInst))
return true;

// OpenMP operations inside omp.target can be lowered for host and device MLIR
if (opInst->getParentOfType<omp::TargetOp>())
return true;

// TODO: Add support for test case:
// omp.parallel { //host pragma
// omp.target { }
// }
bool hasTargetRegion =
opInst->walk([](omp::TargetOp) { return WalkResult::interrupt(); })
.wasInterrupted();
if (hasTargetRegion)
opInst->emitError("Target region inside other pragma is not yet supported");
Comment on lines +146 to +150
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be checked before the previous check that looks for a parent omp.target. Otherwise, this wouldn't trigger the error:

omp.target {
  omp.teams {
    omp.target { }
  }
}

That kind of thing might eventually appear for reverse-offload regions, though we're a long way off from supporting that yet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for @skatrak - What is the meaning of the a omp.target nested inside another omp.target? Is it a NOP (I am guessing not since you refer to "reverse offloading")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, what happens when you enter an OpenMP target region is that execution moves to a target device. If a target device encounters a target region annotated with the ancestor device modifier, then this region must run in the host and the device would wait for it to finish before resuming. This is called reverse offloading, which is not implemented either in flang or clang as far as I know. Sections 13.2 and 13.8 of the 5.2 spec explain this if you'd like a bit more info about it.

Since it's not implemented yet, we can't be 100% sure of how we're going to represent this in MLIR, but I think it's likely that we would just follow the way it's done in the source and get something similar to this:

// runs in the host...
omp.target {
  // runs in the device...
  omp.target device(ancestor = 1) {
    // runs in the host...
  }
  // continues running in the device...
}
// continues running in the host...

This is where I think nested omp.target ops could come from in the future. Not sure if it's possible to create more target regions within a reverse-offload region, but if that's the case I guess it would be possible to have a multi-level nest of target regions at some point, interleaving host and device code.

Copy link

@bhandarkar-pranav bhandarkar-pranav Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I hope applications that do this have kernels that are big enough to make it worthwhile to do all the shuttling between host and device (or the cost of this shuttling is kept really low through zero-copy buffers and such things but still there can be such a thing as too much shuttling I think).


// Check if given OpenMP operation belongs to function labelled with
// omp declare target pragma
LLVM::LLVMFuncOp funcOp = opInst->getParentOfType<LLVM::LLVMFuncOp>();
omp::DeclareTargetDeviceType declareType = omp::DeclareTargetDeviceType::host;

if (!funcOp)
return false;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the default if it's not inside of a function should be to lower it rather than ignoring it.

Suggested change
return false;
return true;

auto declareTargetOp =
dyn_cast<omp::DeclareTargetInterface>(funcOp.getOperation());
if (declareTargetOp && declareTargetOp.isDeclareTarget())
declareType = declareTargetOp.getDeclareTargetDeviceType();
if ((declareType == omp::DeclareTargetDeviceType::host) &&
ompBuilder->Config.isGPU()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ompBuilder->Config.isGPU()) {
ompBuilder->Config.isTargetDevice()) {

return false;
}
return true;
}

/// Converts the given region that appears within an OpenMP dialect operation to
/// LLVM IR, creating a branch from the `sourceBlock` to the entry block of the
/// region, and a branch from any block with an successor-less OpenMP terminator
Expand Down Expand Up @@ -3182,6 +3222,9 @@ LogicalResult OpenMPDialectLLVMIRTranslationInterface::convertOperation(

llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder();

// Skip lowering of an OpenMP operation if it's context is not appropriate
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Skip lowering of an OpenMP operation if it's context is not appropriate
// Skip lowering of an OpenMP operation if its context is not appropriate.

if (!isOpAllowedToBeLowered(op, ompBuilder))
return success();
return llvm::TypeSwitch<Operation *, LogicalResult>(op)
.Case([&](omp::BarrierOp) {
ompBuilder->createBarrier(builder.saveIP(), llvm::omp::OMPD_barrier);
Expand Down
3 changes: 2 additions & 1 deletion mlir/test/Target/LLVMIR/omptarget-parallel-wsloop.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memory_space", 5 : ui32>>, llvm.data_layout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8", llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_gpu = true, omp.is_target_device = true } {
llvm.func @target_parallel_wsloop(%arg0: !llvm.ptr) attributes {
target_cpu = "gfx90a",
target_features = #llvm.target_features<["+gfx9-insts", "+wavefrontsize64"]>
target_features = #llvm.target_features<["+gfx9-insts", "+wavefrontsize64"]>,
omp.declare_target = #omp.declaretarget<device_type = (any)>
} {
omp.parallel {
%loop_ub = llvm.mlir.constant(9 : i32) : i32
Expand Down
2 changes: 1 addition & 1 deletion mlir/test/Target/LLVMIR/omptarget-wsloop-collapsed.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
// for nested omp do loop with collapse clause inside omp target region

module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memory_space", 5 : ui32>>, llvm.data_layout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8", llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_gpu = true, omp.is_target_device = true } {
llvm.func @target_collapsed_wsloop(%arg0: !llvm.ptr) {
llvm.func @target_collapsed_wsloop(%arg0: !llvm.ptr) attributes {omp.declare_target = #omp.declaretarget<device_type = (any)> } {
%loop_ub = llvm.mlir.constant(99 : i32) : i32
%loop_lb = llvm.mlir.constant(0 : i32) : i32
%loop_step = llvm.mlir.constant(1 : index) : i32
Expand Down
6 changes: 4 additions & 2 deletions mlir/test/Target/LLVMIR/omptarget-wsloop.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
// for nested omp do loop inside omp target region

module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memory_space", 5 : ui32>>, llvm.data_layout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8", llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_gpu = true, omp.is_target_device = true } {
llvm.func @target_wsloop(%arg0: !llvm.ptr ){
llvm.func @target_wsloop(%arg0: !llvm.ptr ) attributes {
omp.declare_target = #omp.declaretarget<device_type = (any)> } {
%loop_ub = llvm.mlir.constant(9 : i32) : i32
%loop_lb = llvm.mlir.constant(0 : i32) : i32
%loop_step = llvm.mlir.constant(1 : i32) : i32
Expand All @@ -16,7 +17,8 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memo
llvm.return
}

llvm.func @target_empty_wsloop(){
llvm.func @target_empty_wsloop() attributes {
omp.declare_target = #omp.declaretarget<device_type = (any)> } {
%loop_ub = llvm.mlir.constant(9 : i32) : i32
%loop_lb = llvm.mlir.constant(0 : i32) : i32
%loop_step = llvm.mlir.constant(1 : i32) : i32
Expand Down