Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IterDomain resize for pad, cat, slice #2480

Merged
merged 93 commits into from
Mar 14, 2023
Merged

IterDomain resize for pad, cat, slice #2480

merged 93 commits into from
Mar 14, 2023

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented Feb 16, 2023

This PR adds codegen support of pad, slice and cat. All of them are represented using a new IterDomain expression, Resize. Please see test_gpu_resize.cpp for concrete examples.

The pointwise, reduction and normalization schedulers should be able to work with resized domains. Transformation and parallelization propagations are done while skipping, or forwarding, resized domains. So, for example:

tv0: [I0, I1]
tv1: [I0+2, I1] // axis 0 is padded by one at each of its left and right sides

tv1->split(0, 4);
tv1: [ceilDiv(I0 + 2, 4), 4, I1]

// After propagating tv1 to tv0
tv0: [ceilDiv(I0, 4), 4, I1]

And for parallelization:

tv1->axis(0)->parallelize(ParallelType::BIDx);
tv1: [bidx(ceilDiv(I0 + 2, 4)), 4, I1]

// After propagating tv1 to tv0
tv0: [bidx(ceilDiv(I0, 4)), 4, I1]

Unlike these propagations, inlining is not allowed between these domains with different extents, at least at this moment. So, inlineMost would not change the CA position of tv0 from 0.

The difference of honoring (i.e., inlining) or ignoring resize (transform and parallel propagations) requires different behavior in some of the underlying replay primitives such as BestEffortReplay. As you can see, a new option, skip_resize, is added, which behaves similar to skip_swizzle. This unfortunately further makes them error prone due to optional boolean flags (here's a recent bug we found). I'm already prototyping a cleanup PR to refactor these options so that they are more explicitly configured without little ambiguity.

I'm also thinking about adding two more IterDomain expressions, gather and scatter. With resize and gather, we should be able to implement the PyTorch gather.

Another thing I'm really interested in trying out is to use slice and cat to allow multiple memory spaces to be used for a single tensor in persistent kernels. I believe this should be possible, but it would need some more work.

TODOs

  • Add validation of resize usage. Must be within root and rfactor domains when given to GpuLower
  • Fix string matching errors of generated code with some of the C++ tests

TODOs likely not in this PR

  • Support non-zero constant padding

Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all my concerns have been resolved. This PR LGTM now. I am not stamping to avoid accidental merge because @csarofeen hasn't done review yet. Please let me know if you want a stamp.

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 8, 2023

It would be nice if we added some more information to some of the asserts in the PR, especially when an operator fails due to user input it would be good to mention which operation and more details about which input causes the issue.

I revisited them and added more descriptive error messages.

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 8, 2023

@zasdfgbnm I added this commit that adds an override for Resize: f187d6d. This is only used when resize is used with a tensor that has siblings, and since there's no such expr, it's just for consistency. I also noticed there's no override for Swizzle2D, which is fine as there shouldn't be any multi-output expr whose outputs should be swizzled, but added an assert just in case.

@naoyam naoyam requested a review from csarofeen March 8, 2023 05:07
Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished reviewing everything. I don't see any blockers. Nice work.

third_party/nvfuser/csrc/lower_validation.cpp Show resolved Hide resolved
//! move its producer to a shared memory that is sufficient to satisfy
//! the dependency. A proper RAW sync will be automatically inserted
//! when the fusion is lowered.
TORCH_CUDA_CU_API void promoteProducerMemoryTypesOfResizedTensors(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of the heuristics in case shared memory isn't large enough to do this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but not sure what the right heuristics should look like. If there's not large enough shared memory buffer is available, an easiest workaround would be using global memory, but that could be a big perf overhead. Should we instead consider different parallelization that won't require a RAW sync? To develop an effective heuristic, we would need to see real instances. I'll keep it as is for now. If the buffer space becomes a problem, we should see a hard error before a kernel is launched.

// resize is not indexed as the indexing is done using the rfactor root
// domain. This could be an issue when a resize is shows up outside of
// rfactor transfomations, but currently that only can happen when a
// producer tensor is transformed to look like a consumer. Since inlining is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this was what I was missing on having resize outside root->rfactor.

}

// Disabled due to the same reason as Pad8
#if 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this case just get segmented so the multiple resizes don't appear in the same kernel?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can enable support if we don't segment properly on all cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting part of this fusion is, although tv1 is padded twice differently, the two padding outputs, tv2 and tv3, are summed together, producing tv4, so the rfactor IDs of tv2, tv3 and the root IDs of tv4 are all exactly mapped. If not, multiple different resizes ops should result in segmented fusions, much like different views are handled.

This test actually works successfully. It's just scheduled as a single pointwise fusion. But I'm not sure how a fusion like this should be scheduled.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the point being root and rfactor match exactly because producers and consumers are common, but the root->rfactor paths are different. That's interesting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.

}

// Cat many tensors
TEST_F(NVFuserTest, FusionResizeCat7_CUDA) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you check that an error gets thrown if runtime values don't support the requested concat?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type and number of dimensions must match:
https://github.com/csarofeen/pytorch/pull/2480/files/82845fa11d79a153f62fd11fa4daf473476cd728#diff-bee9b138517ab66ac5e215f0603aacbacc8984fda748a6d636e36595ec0b12deR467-R494.

But that's it. For example, the extents of the non-concatenated dimensions must match, but that's not checked, like normal arithmetic ops where we assume two operands have the same shape.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to add a runtime check for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to check. We don't check, for example, in add(tv0, tv1), tv0 and tv1 have the same shape, but rather we assume that's the case and use that to build the IterDomain graph. Following the same approach, non-concatenated domains should be assumed to be the same.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, that's a good point, and that's why this would error out during expression evaluator as our evaluated symbols for parallelization wouldn't match. Okay, will put it in the large bucket of things we should do. Would be nice to have a validation check that everything matches as expected across the iter domain relationships.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like a pedantic mode, where everything is validated no matter how high its overhead would be.

auto tv5 = sum(tv3, {1});
auto tv6 = sum(tv4, {1});
auto tv7 = cat({tv5, tv6}, 0);
fusion.addOutput(tv7);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking register+smem persistent kernels here??

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's still a big gap I really wanted to fill.

@naoyam naoyam merged commit 1e30fee into devel Mar 14, 2023
@naoyam naoyam deleted the iter_domain_resize branch March 14, 2023 00:12
@jjsjann123
Copy link
Collaborator

I'm getting segfault from this PR. NVIDIA/Fuser#3 (comment)

@naoyam
Copy link
Collaborator Author

naoyam commented Mar 14, 2023

I'm getting segfault from this PR. NVIDIA/Fuser#3 (comment)

Hmm, strange. Looking into it.

jacobhinkle added a commit to jacobhinkle/Fuser that referenced this pull request Mar 15, 2023
jacobhinkle pushed a commit to jacobhinkle/Fuser that referenced this pull request Mar 15, 2023
jacobhinkle pushed a commit to jacobhinkle/Fuser that referenced this pull request Mar 15, 2023
naoyam added a commit to NVIDIA/Fuser that referenced this pull request Mar 15, 2023
Cherry-picking from: csarofeen/pytorch#2480

Author: Naoya Maruyama <naoyam@users.noreply.github.com>
Date:   Mon Mar 13 17:12:01 2023 -0700

    IterDomain resize for pad, cat, slice (#2480)

---------

Co-authored-by: samnordmann <snordmann@nvidia.com>
Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>
Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
jacobhinkle pushed a commit to jacobhinkle/Fuser that referenced this pull request Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants