Fix outer reduction performance #1698

naoyam · 2022-05-13T01:07:25Z

Improves performance of outer reductions, particularly for channels last batch norm like normalizations.

…oops without turning the kernel cooperative.

… benchmark.

… outer_red_testing

…stics.

csarofeen

I made a lot of the changes, leaving the review to Naoya.

csarofeen · 2022-05-16T22:20:59Z

torch/csrc/jit/codegen/cuda/scheduler/normalization.cpp

@@ -13,6 +13,8 @@



This file should be removed from this PR!

This is from #1695 and evidently 1695 damages channels first perf.

csarofeen · 2022-05-16T22:23:21Z

torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp

@@ -466,20 +517,6 @@ ReductionParams OuterReductionHeuristic(
    const int64_t n_tensor_inputs,
    const int64_t max_input_dtype_size,
    const size_t vectorize_factor) {
-  // Set some targets for parallelization


Start of outer reduction heuristic changes.

csarofeen · 2022-05-16T22:23:57Z

torch/csrc/jit/codegen/cuda/scheduler/reduction_heuristic.h

@@ -111,7 +115,7 @@ class ReductionParams {
    bool attr_equal = other.fastest_dim == fastest_dim &&
        other.persistent_kernel == persistent_kernel &&
        other.project_persistent_buffers == project_persistent_buffers &&
-        other.schedule_3D == schedule_3D &&


flip_grid was never used, may want to leave it in to play with later because I didn't test in the end if there are any cases that can benefit. Just please mark with a TODO.

csarofeen · 2022-05-16T22:24:11Z

torch/csrc/jit/codegen/cuda/scheduler/registry.cpp

-  auto tv_root = TensorDomain::noReductions(
-      tv->hasReduction() && tv->hasRFactor() ? tv->getRootDomain()
-                                             : tv->getMaybeRFactorDomain());
+  auto tv_root = tv->hasReduction() && tv->hasRFactor()


This was a fun fix.

naoyam · 2022-05-16T22:34:39Z

torch/csrc/jit/codegen/cuda/scheduler/registry.cpp

+
+  auto contiguity = tv->domain()->contiguity();
+  // Appears after reductions the reduction domain often has a contiguity entry.
+  // This only matters if the result of the reduction is an output


A little confused here. Can contiguity() sometimes only account for non-reduction domains but other times both reduction and non-reduction domains?

Seems when we reduce the resulting tensor has a contiguity entry for the reduction dims, we should probably fix that.

Filed an issue #1708

torch/csrc/jit/codegen/cuda/test/test_gpu.cpp

naoyam · 2022-05-16T23:30:05Z

torch/csrc/jit/codegen/cuda/scheduler/reduction.cpp

+  bool flip_grid = gidim > 1 && gidim < 8;
+  flip_grid = false;


So, currently, are we just always disabling this?

Yes, it is piped through correctly to use, but I haven't used it yet as I didn't see any cases benefit. However, I didn't check if any cases benefit after tuning the heuristics, I want to give it a try when I have a chance then will cleanup if it's not useful on outer reductions.

naoyam · 2022-05-17T06:15:36Z

The PR looks good to me, though I have seen vectorization error with FusionWelfordShmoo, which I can't repro.

[ RUN      ] NVFuserTest.FusionWelfordShmoo_CUDA
unknown file: Failure
C++ exception with description "Tried to vectorize a dim resulting in a word size of 32 however, vector sizes only upto and including 16 bytes are supported.
Exception raised from validate at ../torch/csrc/jit/codegen/cuda/lower_validation.cpp:396 (most recent call first):
frame #0: <unknown function> + 0xa44b0 (0x7f264ea564b0 in /home/nmaruyama/pytorch/debug2/build/lib/libc10.so)
frame #1: std::function<std::__cxx11::basic_string<char, st

The error is concerning, but I don't find where the reduction scheduler could set the invalid vectorization size. Although not ideal, this shouldn't cause silent errors but should be detected by the validation.

Also, FusionViewPersistentShmoo would hit invalid memory access errors with compute-sanitizer. Needs #1707.

csarofeen

LGTM Thanks for the cleanup!

csarofeen and others added 15 commits May 8, 2022 03:53

Allow re-entrant grid reductions, allowing them to be placed inside l…

18c91cf

…oops without turning the kernel cooperative.

Add more timm micro benchmarks.

3a2eaaa

Rework parallelization scheme of inner normalization kernels for TIMM…

522c989

… benchmark.

Testing outer reduction heuristics.

1329190

Do not over-allocate the reduction grid dimension

dc6ae85

Use different sync flags at different grid sync calls

c40783f

Maybe the right heuristic.

447302b

Fix available vectorization calculation.

3bc5ba3

Another iteration of outer reduction scheduler.

47ed45b

Add more reduction benchmark sizes.

a32a4e9

Revert unnecessary benchmark changes.

c284bdf

Merge branch 'devel' of https://www.github.com/csarofeen/pytorch into…

08f23ef

… outer_red_testing

Add more TIMM BN sizes.

7d3384c

TIMM tuned outer reduce.

f6c3321

Fast outer reduction/norms. Or as fast as I can reasonably make heuri…

f93b2da

…stics.

csarofeen changed the title ~~Outer red testing~~ [WIP] Fix outer reduction heuristics May 16, 2022

Cleanup.

845ca26

csarofeen changed the title ~~[WIP] Fix outer reduction heuristics~~ Fix outer reduction performance May 16, 2022

csarofeen self-requested a review May 16, 2022 21:58

csarofeen reviewed May 16, 2022

View reviewed changes

naoyam added 2 commits May 16, 2022 15:50

Cleanup

b279c9b

Clean up and a TODO comment

659d2eb

naoyam commented May 16, 2022

View reviewed changes

naoyam added 2 commits May 16, 2022 16:32

enable vectorization in test

dbe0f11

cleanup

4a3167a

naoyam added 7 commits May 16, 2022 16:34

clang-format

1b16f2b

Revert scheduler/normalization.cpp

41608dc

clang-tidy

d956aff

cleanup

f301d54

Merge branch 'devel' into outer_red_testing

80c0a6a

Don't predicate contig-merged domain if not mapped from reference

8900950

fix typo

5d397bf

csarofeen approved these changes May 17, 2022

View reviewed changes

csarofeen merged commit 08fa359 into devel May 17, 2022

This was referenced May 17, 2022

Enable grid reductions within loops. #1681

Closed

Batchnorm tuning (re-entrant grid reduction) #1696

Closed

naoyam mentioned this pull request May 17, 2022

Cleaning up contiguity vector with reduction #1708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix outer reduction performance #1698

Fix outer reduction performance #1698

naoyam commented May 13, 2022 •

edited by csarofeen

Loading

csarofeen left a comment

csarofeen May 16, 2022

csarofeen May 16, 2022

csarofeen May 16, 2022

csarofeen May 16, 2022

csarofeen May 16, 2022

naoyam May 16, 2022

csarofeen May 16, 2022

naoyam May 17, 2022

naoyam May 16, 2022

csarofeen May 17, 2022

naoyam commented May 17, 2022

csarofeen left a comment

Fix outer reduction performance #1698

Fix outer reduction performance #1698

Conversation

naoyam commented May 13, 2022 • edited by csarofeen Loading

csarofeen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naoyam commented May 17, 2022

csarofeen left a comment

Choose a reason for hiding this comment

naoyam commented May 13, 2022 •

edited by csarofeen

Loading