Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix outer reduction performance #1698

Merged
merged 27 commits into from
May 17, 2022
Merged

Fix outer reduction performance #1698

merged 27 commits into from
May 17, 2022

Conversation

naoyam
Copy link
Collaborator

@naoyam naoyam commented May 13, 2022

Improves performance of outer reductions, particularly for channels last batch norm like normalizations.

@csarofeen csarofeen changed the title Outer red testing [WIP] Fix outer reduction heuristics May 16, 2022
@csarofeen csarofeen changed the title [WIP] Fix outer reduction heuristics Fix outer reduction performance May 16, 2022
@csarofeen csarofeen self-requested a review May 16, 2022 21:58
Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a lot of the changes, leaving the review to Naoya.

@@ -13,6 +13,8 @@

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be removed from this PR!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from #1695 and evidently 1695 damages channels first perf.

@@ -466,20 +517,6 @@ ReductionParams OuterReductionHeuristic(
const int64_t n_tensor_inputs,
const int64_t max_input_dtype_size,
const size_t vectorize_factor) {
// Set some targets for parallelization
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start of outer reduction heuristic changes.

@@ -111,7 +115,7 @@ class ReductionParams {
bool attr_equal = other.fastest_dim == fastest_dim &&
other.persistent_kernel == persistent_kernel &&
other.project_persistent_buffers == project_persistent_buffers &&
other.schedule_3D == schedule_3D &&
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flip_grid was never used, may want to leave it in to play with later because I didn't test in the end if there are any cases that can benefit. Just please mark with a TODO.

auto tv_root = TensorDomain::noReductions(
tv->hasReduction() && tv->hasRFactor() ? tv->getRootDomain()
: tv->getMaybeRFactorDomain());
auto tv_root = tv->hasReduction() && tv->hasRFactor()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a fun fix.


auto contiguity = tv->domain()->contiguity();
// Appears after reductions the reduction domain often has a contiguity entry.
// This only matters if the result of the reduction is an output
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little confused here. Can contiguity() sometimes only account for non-reduction domains but other times both reduction and non-reduction domains?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems when we reduce the resulting tensor has a contiguity entry for the reduction dims, we should probably fix that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue #1708

torch/csrc/jit/codegen/cuda/test/test_gpu.cpp Outdated Show resolved Hide resolved
Comment on lines 771 to 772
bool flip_grid = gidim > 1 && gidim < 8;
flip_grid = false;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, currently, are we just always disabling this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is piped through correctly to use, but I haven't used it yet as I didn't see any cases benefit. However, I didn't check if any cases benefit after tuning the heuristics, I want to give it a try when I have a chance then will cleanup if it's not useful on outer reductions.

@naoyam
Copy link
Collaborator Author

naoyam commented May 17, 2022

The PR looks good to me, though I have seen vectorization error with FusionWelfordShmoo, which I can't repro.

[ RUN      ] NVFuserTest.FusionWelfordShmoo_CUDA
unknown file: Failure
C++ exception with description "Tried to vectorize a dim resulting in a word size of 32 however, vector sizes only upto and including 16 bytes are supported.
Exception raised from validate at ../torch/csrc/jit/codegen/cuda/lower_validation.cpp:396 (most recent call first):
frame #0: <unknown function> + 0xa44b0 (0x7f264ea564b0 in /home/nmaruyama/pytorch/debug2/build/lib/libc10.so)
frame #1: std::function<std::__cxx11::basic_string<char, st

The error is concerning, but I don't find where the reduction scheduler could set the invalid vectorization size. Although not ideal, this shouldn't cause silent errors but should be detected by the validation.

Also, FusionViewPersistentShmoo would hit invalid memory access errors with compute-sanitizer. Needs #1707.

Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks for the cleanup!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants