-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow allocation to be a split and different from loop. #3479
Comments
Let me see if I understand the issue correctly.
It seems to me that we should change the matmul scheduler rather than Since this is only a problem with the matmul scheduler, I feel that we should extend the matmul scheduler to do scheduling of the allocation domain more explicitly. What do you think? @jacobhinkle @rdspring1 @protonu @zasdfgbnm |
My .02: when I encountered this in the matmul scheduler I was surprised it propagated allocation domain. I think we should change that behavior since explicit is better than implicit. |
Isn't this guarded by a flag? Fuser/csrc/transform_replay.cpp Line 733 in 9c9c34c
Would a quick unblock be to just set this flag to false wherever it is used? I believe it was used in cacheBefore /cacheAfter .
But on the other hand, I do think replaying allocation domain (or at least stride order) is what we want. For example, if you have a fusion input |
@zasdfgbnm For Yes, setting |
We have to do that conditionally -- propagate only for the matmul scheduler |
It depend on how we define what is a correct allocation domain, which we do not have an idea yet. For example, if a register tensor is:
Then do we consider the allocation of this tensor correct? The answer could be:
1 is what we currently have in the legacy indexer. If in the new indexer, we still decide to keep 1, then I believe we should also allow having an allocation of If our design decision is 2, then of course, we should schedule allocation domain separately, instead of propagating it along with the loop domain. I personally don't have preference over these two options, because to me both have pros and cons. I like 1 because it makes writing a scheduler easier, because in 2, the scheduler needs to book keep the allocation domain. I dislike 1 because, what is the point of having an allocation domain, if we are not faithfully respecting it? I like 2 because it makes the concept "allocation domain" mentally easier to think about, and I dislike it because it is more difficult to write a scheduler. For example, if all I want is to say "[I1, I0] is the correct stride order", I have to do all the heavy work of analyzing the loop domain and think about what is the correct allocation domain, instead of just naively set it as |
I see your point. From the scheduler's point of view, what matters is usually just the ordering, so it would make sense to #1 would be good enough for the scheduler. I think we would need to make sure that the lowering can infer the true allocation domain without any ambiguity, but I think that'd be the case. In any case, I still prefer this approach:
And make the required ordering of the allocation domain for matmul done explicitly. If we all agree on this, can someone volunteer to change the matmul scheduler? |
Add a knob like replay_allocation to cacheInputs and set that to true only in the matmul scheduler? If yes, I'm happy to take that. |
@wujingyue Yes, that should be good enough, although I still think we should do something with the matmul scheduler since even @jacobhinkle was surprised with the automatic reordering. |
This is a spin-off from #3458 (comment).
For a multi-GPU fusion to take a sharded input tensor, the allocation domain has to be a split of logical. For example,
The loop domain, ideally, shouldn't be set or used because a fusion/segment input comes from outside and is not generated by a loop. Below is an ideal usage, which I also committed to https://github.com/NVIDIA/Fuser/tree/bug3479.
This repro currently fails with the following error:
Code around
Fuser/csrc/transform_replay.cpp
Lines 760 to 763 in 9c9c34c
Fuser/csrc/transform_replay.cpp
Line 776 in 9c9c34c
While I've worked around this problem by setting loop to be the same as allocation, @naoyam and I discussed some potential solutions in the original thread. There's a line of thoughts on improving
replayCasP
, and there's a line of thoughts on propagating allocation domain within a kernel through a different mechanism as this is currently only needed for matmul. cc @zasdfgbnmThe text was updated successfully, but these errors were encountered: