-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix outer reduction performance #1698
Conversation
…oops without turning the kernel cooperative.
… outer_red_testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a lot of the changes, leaving the review to Naoya.
@@ -13,6 +13,8 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be removed from this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from #1695 and evidently 1695 damages channels first perf.
@@ -466,20 +517,6 @@ ReductionParams OuterReductionHeuristic( | |||
const int64_t n_tensor_inputs, | |||
const int64_t max_input_dtype_size, | |||
const size_t vectorize_factor) { | |||
// Set some targets for parallelization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start of outer reduction heuristic changes.
@@ -111,7 +115,7 @@ class ReductionParams { | |||
bool attr_equal = other.fastest_dim == fastest_dim && | |||
other.persistent_kernel == persistent_kernel && | |||
other.project_persistent_buffers == project_persistent_buffers && | |||
other.schedule_3D == schedule_3D && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flip_grid was never used, may want to leave it in to play with later because I didn't test in the end if there are any cases that can benefit. Just please mark with a TODO.
auto tv_root = TensorDomain::noReductions( | ||
tv->hasReduction() && tv->hasRFactor() ? tv->getRootDomain() | ||
: tv->getMaybeRFactorDomain()); | ||
auto tv_root = tv->hasReduction() && tv->hasRFactor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a fun fix.
|
||
auto contiguity = tv->domain()->contiguity(); | ||
// Appears after reductions the reduction domain often has a contiguity entry. | ||
// This only matters if the result of the reduction is an output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little confused here. Can contiguity()
sometimes only account for non-reduction domains but other times both reduction and non-reduction domains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems when we reduce the resulting tensor has a contiguity entry for the reduction dims, we should probably fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed an issue #1708
bool flip_grid = gidim > 1 && gidim < 8; | ||
flip_grid = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, currently, are we just always disabling this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is piped through correctly to use, but I haven't used it yet as I didn't see any cases benefit. However, I didn't check if any cases benefit after tuning the heuristics, I want to give it a try when I have a chance then will cleanup if it's not useful on outer reductions.
The PR looks good to me, though I have seen vectorization error with
The error is concerning, but I don't find where the reduction scheduler could set the invalid vectorization size. Although not ideal, this shouldn't cause silent errors but should be detected by the validation. Also, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks for the cleanup!
Improves performance of outer reductions, particularly for channels last batch norm like normalizations.