Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llu/ln bwd #207

Merged
merged 4 commits into from
Apr 25, 2023
Merged

Llu/ln bwd #207

merged 4 commits into from
Apr 25, 2023

Conversation

liqiangxl
Copy link
Collaborator

This PR is copy of csarofeen/pytorch#2400
Benchmark results compared with Apex is:
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename the test file to test_combined_inner_outer_reduction.cpp

// This case is to test the correctness of the combined inner and outer
// scheduler used in layer norm backward. It can also be configured to test the
// performance using different data types.
TEST_F(NVFuserTest, FusionCombinedSchedulerLayerNormBackward_CUDA) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are simplifying test names and the Fusion prefix is no longer used.

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just added comments on test naming

@liqiangxl
Copy link
Collaborator Author

!build

4 similar comments
@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl
Copy link
Collaborator Author

!build

@liqiangxl liqiangxl merged commit 0250132 into main Apr 25, 2023
@liqiangxl liqiangxl deleted the llu/ln_bwd branch April 25, 2023 21:44
Copy link
Collaborator

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liqiangxl @naoyam why wasn't this implemented as a different scheduler instead of stuffing it into normalization scheduler. You guys know we don't need to schedule every type of fusion in a single scheduler right?

@naoyam
Copy link
Collaborator

naoyam commented Aug 5, 2023

The normalization schedulers are becoming like a collection of related but significantly different schedulers: block-parallel inner normalization, grid-parallel inner normalization (we don't have this yet), block-parallel outer normalization, grid-parallel outer normalization, grid-parallel inner-and-outer normalization (special case for layernorm backward). All of them can be an independent scheduler instead of making them aggregated as the single normalization scheduler.

I haven't thought about potential implications of the two approaches, but the current aggregated approach seems to make a little more sense for the scheduling performance. If they were all individual schedulers, we would need to call the canSchedule of all individual schedulers until the successful scheduler is found, and the canSchedule functions of all the normalization scheduler variants would likely to have some common analyses, which would be redundantly executed, like finding persistent buffers and analyzing their sizes. The common analyses are executed just once in our current design since all of them are just branched out from the single normalization scheduler. The downside is of course the latter definitely makes the scheduler look more unstructured.

I'd say we should keep them as is for now. We would need to rethink about the whole scheduler and segmentation design for more flexibility and composability.

@csarofeen
Copy link
Collaborator

@liqiangxl could you pull some timing info from some benchmarks to understand how much time is spent in normalization can schedule? I understand the concern @naoyam but it seems like really messy code for the sake of maybe some performance at compilation time.

@csarofeen
Copy link
Collaborator

PS, you can still have "a scheduler" from the registry perspective that calls into multiple heuristic and scheduling functions.

@liqiangxl
Copy link
Collaborator Author

@liqiangxl could you pull some timing info from some benchmarks to understand how much time is spent in normalization can schedule? I understand the concern @naoyam but it seems like really messy code for the sake of maybe some performance at compilation time.

Sure. The current normalization canSchedule will return true if scheduling can be achieved using one of the inner, outer, or combined heuristics. The system is already aware of the specific heuristic to utilize. However, rather than directly accessing the appropriate heuristic, it employs a broader interface, getPersistentHeuristics. This interface conducts further analysis and then directs to one of the inner, outer, or combined heuristics. This is also the case for schedulePersistentKernel, which serves as a universal interface for the three distinct heuristics. By transitioning from these general interfaces to individual interfaces tailored for each heuristic, we might achieve a cleaner code structure. The change will be like:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants