-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llu/ln bwd #207
Llu/ln bwd #207
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename the test file to test_combined_inner_outer_reduction.cpp
// This case is to test the correctness of the combined inner and outer | ||
// scheduler used in layer norm backward. It can also be configured to test the | ||
// performance using different data types. | ||
TEST_F(NVFuserTest, FusionCombinedSchedulerLayerNormBackward_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are simplifying test names and the Fusion
prefix is no longer used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just added comments on test naming
!build |
4 similar comments
!build |
!build |
!build |
!build |
!build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liqiangxl @naoyam why wasn't this implemented as a different scheduler instead of stuffing it into normalization scheduler. You guys know we don't need to schedule every type of fusion in a single scheduler right?
The normalization schedulers are becoming like a collection of related but significantly different schedulers: block-parallel inner normalization, grid-parallel inner normalization (we don't have this yet), block-parallel outer normalization, grid-parallel outer normalization, grid-parallel inner-and-outer normalization (special case for layernorm backward). All of them can be an independent scheduler instead of making them aggregated as the single normalization scheduler. I haven't thought about potential implications of the two approaches, but the current aggregated approach seems to make a little more sense for the scheduling performance. If they were all individual schedulers, we would need to call the I'd say we should keep them as is for now. We would need to rethink about the whole scheduler and segmentation design for more flexibility and composability. |
@liqiangxl could you pull some timing info from some benchmarks to understand how much time is spent in normalization can schedule? I understand the concern @naoyam but it seems like really messy code for the sake of maybe some performance at compilation time. |
PS, you can still have "a scheduler" from the registry perspective that calls into multiple heuristic and scheduling functions. |
Sure. The current normalization |
This PR is copy of csarofeen/pytorch#2400
Benchmark results compared with Apex is: