IterDomain resize for pad, cat, slice #2480

naoyam · 2023-02-16T22:43:24Z

This PR adds codegen support of pad, slice and cat. All of them are represented using a new IterDomain expression, Resize. Please see test_gpu_resize.cpp for concrete examples.

The pointwise, reduction and normalization schedulers should be able to work with resized domains. Transformation and parallelization propagations are done while skipping, or forwarding, resized domains. So, for example:

tv0: [I0, I1]
tv1: [I0+2, I1] // axis 0 is padded by one at each of its left and right sides

tv1->split(0, 4);
tv1: [ceilDiv(I0 + 2, 4), 4, I1]

// After propagating tv1 to tv0
tv0: [ceilDiv(I0, 4), 4, I1]

And for parallelization:

tv1->axis(0)->parallelize(ParallelType::BIDx);
tv1: [bidx(ceilDiv(I0 + 2, 4)), 4, I1]

// After propagating tv1 to tv0
tv0: [bidx(ceilDiv(I0, 4)), 4, I1]

Unlike these propagations, inlining is not allowed between these domains with different extents, at least at this moment. So, inlineMost would not change the CA position of tv0 from 0.

The difference of honoring (i.e., inlining) or ignoring resize (transform and parallel propagations) requires different behavior in some of the underlying replay primitives such as BestEffortReplay. As you can see, a new option, skip_resize, is added, which behaves similar to skip_swizzle. This unfortunately further makes them error prone due to optional boolean flags (here's a recent bug we found). I'm already prototyping a cleanup PR to refactor these options so that they are more explicitly configured without little ambiguity.

I'm also thinking about adding two more IterDomain expressions, gather and scatter. With resize and gather, we should be able to implement the PyTorch gather.

Another thing I'm really interested in trying out is to use slice and cat to allow multiple memory spaces to be used for a single tensor in persistent kernels. I believe this should be possible, but it would need some more work.

TODOs

Add validation of resize usage. Must be within root and rfactor domains when given to GpuLower
Fix string matching errors of generated code with some of the C++ tests

TODOs likely not in this PR

Support non-zero constant padding

zasdfgbnm

I think all my concerns have been resolved. This PR LGTM now. I am not stamping to avoid accidental merge because @csarofeen hasn't done review yet. Please let me know if you want a stamp.

naoyam · 2023-03-08T02:43:09Z

It would be nice if we added some more information to some of the asserts in the PR, especially when an operator fails due to user input it would be good to mention which operation and more details about which input causes the issue.

I revisited them and added more descriptive error messages.

naoyam · 2023-03-08T05:07:33Z

@zasdfgbnm I added this commit that adds an override for Resize: f187d6d. This is only used when resize is used with a tensor that has siblings, and since there's no such expr, it's just for consistency. I also noticed there's no override for Swizzle2D, which is fine as there shouldn't be any multi-output expr whose outputs should be swizzled, but added an assert just in case.

csarofeen

Finished reviewing everything. I don't see any blockers. Nice work.

third_party/nvfuser/csrc/lower_validation.cpp

csarofeen · 2023-03-08T14:05:52Z

third_party/nvfuser/csrc/scheduler/utils.h

+//! move its producer to a shared memory that is sufficient to satisfy
+//! the dependency. A proper RAW sync will be automatically inserted
+//! when the fusion is lowered.
+TORCH_CUDA_CU_API void promoteProducerMemoryTypesOfResizedTensors(


Should this be part of the heuristics in case shared memory isn't large enough to do this?

Agreed, but not sure what the right heuristics should look like. If there's not large enough shared memory buffer is available, an easiest workaround would be using global memory, but that could be a big perf overhead. Should we instead consider different parallelization that won't require a RAW sync? To develop an effective heuristic, we would need to see real instances. I'll keep it as is for now. If the buffer space becomes a problem, we should see a hard error before a kernel is launched.

csarofeen · 2023-03-09T14:09:11Z

third_party/nvfuser/csrc/index_compute.cpp

+    // resize is not indexed as the indexing is done using the rfactor root
+    // domain. This could be an issue when a resize is shows up outside of
+    // rfactor transfomations, but currently that only can happen when a
+    // producer tensor is transformed to look like a consumer. Since inlining is


Okay, this was what I was missing on having resize outside root->rfactor.

csarofeen · 2023-03-09T14:23:48Z

third_party/nvfuser/test/test_resize.cpp

+}
+
+// Disabled due to the same reason as Pad8
+#if 0


Shouldn't this case just get segmented so the multiple resizes don't appear in the same kernel?

I don't think we can enable support if we don't segment properly on all cases.

An interesting part of this fusion is, although tv1 is padded twice differently, the two padding outputs, tv2 and tv3, are summed together, producing tv4, so the rfactor IDs of tv2, tv3 and the root IDs of tv4 are all exactly mapped. If not, multiple different resizes ops should result in segmented fusions, much like different views are handled.

This test actually works successfully. It's just scheduled as a single pointwise fusion. But I'm not sure how a fusion like this should be scheduled.

So the point being root and rfactor match exactly because producers and consumers are common, but the root->rfactor paths are different. That's interesting.

That's right.

csarofeen · 2023-03-09T14:26:48Z

third_party/nvfuser/test/test_resize.cpp

+}
+
+// Cat many tensors
+TEST_F(NVFuserTest, FusionResizeCat7_CUDA) {


Do you check that an error gets thrown if runtime values don't support the requested concat?

The type and number of dimensions must match:
https://github.com/csarofeen/pytorch/pull/2480/files/82845fa11d79a153f62fd11fa4daf473476cd728#diff-bee9b138517ab66ac5e215f0603aacbacc8984fda748a6d636e36595ec0b12deR467-R494.

But that's it. For example, the extents of the non-concatenated dimensions must match, but that's not checked, like normal arithmetic ops where we assume two operands have the same shape.

I think it makes sense to add a runtime check for this.

I'm not sure what to check. We don't check, for example, in add(tv0, tv1), tv0 and tv1 have the same shape, but rather we assume that's the case and use that to build the IterDomain graph. Following the same approach, non-concatenated domains should be assumed to be the same.

Heh, that's a good point, and that's why this would error out during expression evaluator as our evaluated symbols for parallelization wouldn't match. Okay, will put it in the large bucket of things we should do. Would be nice to have a validation check that everything matches as expected across the iter domain relationships.

Something like a pedantic mode, where everything is validated no matter how high its overhead would be.

csarofeen · 2023-03-09T14:28:45Z

third_party/nvfuser/test/test_resize.cpp

+  auto tv5 = sum(tv3, {1});
+  auto tv6 = sum(tv4, {1});
+  auto tv7 = cat({tv5, tv6}, 0);
+  fusion.addOutput(tv7);


Are you thinking register+smem persistent kernels here??

Yes. That's still a big gap I really wanted to fill.

jjsjann123 · 2023-03-14T09:47:13Z

I'm getting segfault from this PR. NVIDIA/Fuser#3 (comment)

naoyam · 2023-03-14T15:40:59Z

I'm getting segfault from this PR. NVIDIA/Fuser#3 (comment)

Hmm, strange. Looking into it.

Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480

Cherry-picking from: csarofeen/pytorch#2480 Author: Naoya Maruyama <naoyam@users.noreply.github.com> Date: Mon Mar 13 17:12:01 2023 -0700 IterDomain resize for pad, cat, slice (#2480) --------- Co-authored-by: samnordmann <snordmann@nvidia.com> Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com> Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480

naoyam added 30 commits February 10, 2023 13:39

WIP

a212e8d

WIP: expand

8d66afa

WIP

e5ee102

WIP

4b707e8

v1

66e5e38

wip

b458177

tests

179679c

WIP: PadOp

c27900d

checkpoint: decided to try rfactor

ba6d5ef

pad test

b826fbc

computeAt

8267442

parallelize

395939d

concat

a7f42d2

cat tests

c748580

WIP: slice

f823b26

bug fix

0251bad

cleanup

5aed8ec

Merge branch 'devel' into catop

c67927e

Merge branch 'devel' into catop

49195d4

Renamed expand to resize

c7f2118

Merge branch 'devel' into catop

ecb72a4

cleanup

0f74cc8

Merge branch 'devel' into catop

71a99d3

cleanup

1864ec9

cleanup

32e7d39

remove TensorView::expand

36f31e0

cleanup

4b5e3da

fix

e23b0c9

cleanup

071f7d2

cleanup

03f3194

zasdfgbnm reviewed Mar 8, 2023

View reviewed changes

naoyam added 2 commits March 7, 2023 18:14

Clean up pad_width vector

a73ff45

Convert pad to set when valid

b3cffb2

kevinstephano mentioned this pull request Mar 8, 2023

Example of Gather/Scatter from CrossEntropyLoss #2556

Open

More descriptive error messages

26061eb

naoyam added 5 commits March 7, 2023 18:49

Merge branch 'devel' into iter_domain_resize

232e701

Missing override implementation

f187d6d

test fix

c1d2be5

Add validation of resize usage

3ed4797

rename test file

82845fa

naoyam requested a review from csarofeen March 8, 2023 05:07

csarofeen approved these changes Mar 9, 2023

View reviewed changes

naoyam added 4 commits March 13, 2023 12:20

fix

1e2c711

Merge branch 'devel' into iter_domain_resize

e366272

merge fix

8c1e228

test fix

39e603e

naoyam mentioned this pull request Mar 13, 2023

[PoC][DRAFT] pad and cat support #2454

Closed

naoyam merged commit 1e30fee into devel Mar 14, 2023

naoyam deleted the iter_domain_resize branch March 14, 2023 00:12

jjsjann123 mentioned this pull request Mar 14, 2023

IterDomain resize for pad, cat, slice NVIDIA/Fuser#3

Merged

jacobhinkle added a commit to jacobhinkle/Fuser that referenced this pull request Mar 15, 2023

IterDomain resize for pad, cat, slice

406d0d1

Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480

jacobhinkle pushed a commit to jacobhinkle/Fuser that referenced this pull request Mar 15, 2023

IterDomain resize for pad, cat, slice

1277724

Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480

jacobhinkle mentioned this pull request Mar 15, 2023

IterDomain resize for pad, cat, slice NVIDIA/Fuser#9

Closed

jacobhinkle pushed a commit to jacobhinkle/Fuser that referenced this pull request Mar 15, 2023

IterDomain resize for pad, cat, slice

58990bd

Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480

jacobhinkle pushed a commit to jacobhinkle/Fuser that referenced this pull request Mar 15, 2023

IterDomain resize for pad, cat, slice

f4baaf1

Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IterDomain resize for pad, cat, slice #2480

IterDomain resize for pad, cat, slice #2480

naoyam commented Feb 16, 2023 •

edited

Loading

zasdfgbnm left a comment

naoyam commented Mar 8, 2023

naoyam commented Mar 8, 2023

csarofeen left a comment

csarofeen Mar 8, 2023

naoyam Mar 13, 2023

csarofeen Mar 9, 2023

csarofeen Mar 9, 2023

csarofeen Mar 9, 2023

naoyam Mar 13, 2023

csarofeen Mar 14, 2023

naoyam Mar 14, 2023

csarofeen Mar 9, 2023

naoyam Mar 13, 2023

csarofeen Mar 14, 2023

naoyam Mar 14, 2023

csarofeen Mar 14, 2023

naoyam Mar 14, 2023

csarofeen Mar 9, 2023

naoyam Mar 13, 2023

jjsjann123 commented Mar 14, 2023

naoyam commented Mar 14, 2023

IterDomain resize for pad, cat, slice #2480

IterDomain resize for pad, cat, slice #2480

Conversation

naoyam commented Feb 16, 2023 • edited Loading

zasdfgbnm left a comment

Choose a reason for hiding this comment

naoyam commented Mar 8, 2023

naoyam commented Mar 8, 2023

csarofeen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjsjann123 commented Mar 14, 2023

naoyam commented Mar 14, 2023

naoyam commented Feb 16, 2023 •

edited

Loading