-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IterDomain resize for pad, cat, slice #2480
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all my concerns have been resolved. This PR LGTM now. I am not stamping to avoid accidental merge because @csarofeen hasn't done review yet. Please let me know if you want a stamp.
I revisited them and added more descriptive error messages. |
@zasdfgbnm I added this commit that adds an override for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished reviewing everything. I don't see any blockers. Nice work.
//! move its producer to a shared memory that is sufficient to satisfy | ||
//! the dependency. A proper RAW sync will be automatically inserted | ||
//! when the fusion is lowered. | ||
TORCH_CUDA_CU_API void promoteProducerMemoryTypesOfResizedTensors( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be part of the heuristics in case shared memory isn't large enough to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but not sure what the right heuristics should look like. If there's not large enough shared memory buffer is available, an easiest workaround would be using global memory, but that could be a big perf overhead. Should we instead consider different parallelization that won't require a RAW sync? To develop an effective heuristic, we would need to see real instances. I'll keep it as is for now. If the buffer space becomes a problem, we should see a hard error before a kernel is launched.
// resize is not indexed as the indexing is done using the rfactor root | ||
// domain. This could be an issue when a resize is shows up outside of | ||
// rfactor transfomations, but currently that only can happen when a | ||
// producer tensor is transformed to look like a consumer. Since inlining is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this was what I was missing on having resize outside root->rfactor.
} | ||
|
||
// Disabled due to the same reason as Pad8 | ||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this case just get segmented so the multiple resizes don't appear in the same kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can enable support if we don't segment properly on all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting part of this fusion is, although tv1
is padded twice differently, the two padding outputs, tv2
and tv3
, are summed together, producing tv4
, so the rfactor IDs of tv2
, tv3
and the root IDs of tv4
are all exactly mapped. If not, multiple different resizes ops should result in segmented fusions, much like different views are handled.
This test actually works successfully. It's just scheduled as a single pointwise fusion. But I'm not sure how a fusion like this should be scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the point being root and rfactor match exactly because producers and consumers are common, but the root->rfactor paths are different. That's interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.
} | ||
|
||
// Cat many tensors | ||
TEST_F(NVFuserTest, FusionResizeCat7_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you check that an error gets thrown if runtime values don't support the requested concat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type and number of dimensions must match:
https://github.com/csarofeen/pytorch/pull/2480/files/82845fa11d79a153f62fd11fa4daf473476cd728#diff-bee9b138517ab66ac5e215f0603aacbacc8984fda748a6d636e36595ec0b12deR467-R494.
But that's it. For example, the extents of the non-concatenated dimensions must match, but that's not checked, like normal arithmetic ops where we assume two operands have the same shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to add a runtime check for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what to check. We don't check, for example, in add(tv0, tv1)
, tv0
and tv1
have the same shape, but rather we assume that's the case and use that to build the IterDomain graph. Following the same approach, non-concatenated domains should be assumed to be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, that's a good point, and that's why this would error out during expression evaluator as our evaluated symbols for parallelization wouldn't match. Okay, will put it in the large bucket of things we should do. Would be nice to have a validation check that everything matches as expected across the iter domain relationships.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like a pedantic mode, where everything is validated no matter how high its overhead would be.
auto tv5 = sum(tv3, {1}); | ||
auto tv6 = sum(tv4, {1}); | ||
auto tv7 = cat({tv5, tv6}, 0); | ||
fusion.addOutput(tv7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking register+smem persistent kernels here??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's still a big gap I really wanted to fill.
I'm getting segfault from this PR. NVIDIA/Fuser#3 (comment) |
Hmm, strange. Looking into it. |
Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480
Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480
Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480
Cherry-picking from: csarofeen/pytorch#2480 Author: Naoya Maruyama <naoyam@users.noreply.github.com> Date: Mon Mar 13 17:12:01 2023 -0700 IterDomain resize for pad, cat, slice (#2480) --------- Co-authored-by: samnordmann <snordmann@nvidia.com> Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com> Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Cherry-pick of csarofeen/pytorch@1e30fee Original PR: csarofeen/pytorch#2480
This PR adds codegen support of
pad
,slice
andcat
. All of them are represented using a new IterDomain expression,Resize
. Please see test_gpu_resize.cpp for concrete examples.The pointwise, reduction and normalization schedulers should be able to work with resized domains. Transformation and parallelization propagations are done while skipping, or forwarding, resized domains. So, for example:
And for parallelization:
Unlike these propagations, inlining is not allowed between these domains with different extents, at least at this moment. So,
inlineMost
would not change the CA position oftv0
from 0.The difference of honoring (i.e., inlining) or ignoring resize (transform and parallel propagations) requires different behavior in some of the underlying replay primitives such as
BestEffortReplay
. As you can see, a new option,skip_resize
, is added, which behaves similar toskip_swizzle
. This unfortunately further makes them error prone due to optional boolean flags (here's a recent bug we found). I'm already prototyping a cleanup PR to refactor these options so that they are more explicitly configured without little ambiguity.I'm also thinking about adding two more IterDomain expressions,
gather
andscatter
. Withresize
andgather
, we should be able to implement the PyTorch gather.Another thing I'm really interested in trying out is to use
slice
andcat
to allow multiple memory spaces to be used for a single tensor in persistent kernels. I believe this should be possible, but it would need some more work.TODOs
TODOs likely not in this PR