Enable grid reductions within loops. #1681

csarofeen · 2022-05-08T07:54:00Z

Allow re-entrant grid reductions, allowing them to be placed inside loops without turning the kernel cooperative.

…oops without turning the kernel cooperative.

naoyam

Looks good so far. Some early comments.

naoyam · 2022-05-09T17:01:14Z

torch/csrc/jit/codegen/cuda/runtime/grid_reduction.cu

+    const nvfuser_index_t entrance_ind,
+    const nvfuser_index_t n_entrances) {


nit: can't imagine these would need 64 bits...

naoyam · 2022-05-09T17:29:34Z

torch/csrc/jit/codegen/cuda/lower_index.cpp

+    if (fl->isTrivial()) {
+      continue;
+    }
+    if (fl->iter_domain()->isThread()) {


fl->isTrivial() should include this condition.

naoyam · 2022-05-09T17:56:36Z

torch/csrc/jit/codegen/cuda/lower_index.cpp

                (grouped_rop->isAllreduce() && is_within_a_loop ? 2 : 1)),
            output->dtype(),
            false);
      });

  const auto sync_buffer = ir_utils::allocGlobalBufferForGridComm(
-      getGridSyncBufferSize(out_domain), DataType::Int, true);
+      getGridSyncBufferSize(out_domain, for_loops_), DataType::Int, true);


GroupedReduction doesn't support reentrance, so this is not necessary right now.

naoyam · 2022-05-09T18:02:21Z

torch/csrc/jit/codegen/cuda/lower_index.cpp

@@ -271,12 +347,25 @@ void IndexLowering::handleGridReduction(

  const auto reduce_buffer = ir_utils::allocGlobalBufferForGridComm(
      getGridCommWorkBufferSize(
-          out_domain, rop->isAllreduce() && is_within_a_loop ? 2 : 1),
+          out_domain,
+          rop->isAllreduce() ? std::vector<kir::ForLoop*>() : for_loops_,


Would be nice if we could refactor the code on the conditional processing of when to expand the buffer.

naoyam · 2022-05-09T18:03:28Z

torch/csrc/jit/codegen/cuda/lower_index.cpp

  return buffer_size;
 }

-Val* getGridSyncBufferSize(const TensorDomain* td) {
+Val* getGridSyncBufferSize(


Do we need to expand the sync buffer?

No, we shouldn't need to. Because we wait until all iterations are done to start cleaning any of them up. Maybe that's a reason it's slow, I think we should use multiple sync buffers for each reduction!

I don't know if it would make a difference, but seems like it's low risk.

csarofeen · 2022-05-17T10:43:41Z

Closing in favor of #1698

Allow re-entrant grid reductions, allowing them to be placed inside l…

18c91cf

…oops without turning the kernel cooperative.

naoyam reviewed May 9, 2022

View reviewed changes

csarofeen closed this May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable grid reductions within loops. #1681

Enable grid reductions within loops. #1681

csarofeen commented May 8, 2022

naoyam left a comment

naoyam May 9, 2022

naoyam May 9, 2022

naoyam May 9, 2022

naoyam May 9, 2022

naoyam May 9, 2022

csarofeen May 10, 2022

csarofeen May 10, 2022

csarofeen commented May 17, 2022

		const nvfuser_index_t entrance_ind,
		const nvfuser_index_t n_entrances) {

Enable grid reductions within loops. #1681

Enable grid reductions within loops. #1681

Conversation

csarofeen commented May 8, 2022

naoyam left a comment

Choose a reason for hiding this comment

naoyam May 9, 2022

Choose a reason for hiding this comment

naoyam May 9, 2022

Choose a reason for hiding this comment

naoyam May 9, 2022

Choose a reason for hiding this comment

naoyam May 9, 2022

Choose a reason for hiding this comment

naoyam May 9, 2022

Choose a reason for hiding this comment

csarofeen May 10, 2022

Choose a reason for hiding this comment

csarofeen May 10, 2022

Choose a reason for hiding this comment

csarofeen commented May 17, 2022