Accept Hopper matmuls and update default heuristic #3579

jacobhinkle · 2024-12-12T14:42:21Z

This updates the default (non-plugin) matmul heuristic to support Hopper matmuls. This change means that we can not run matmuls on Hopper similarly to how we do it on Ampere and Turing, including using the Python interface.

I tried to make the default heuristic somewhat thoughtful and not just a placeholder. Here are some notes about the Hopper heuristic in its current form:

I set the macro to Hopper_64_64_16. I intended to always use the largest macro for which the N size divided the problem's N, but this led to lower perf on the handful of examples I looked at. We should benchmark more and find out why this is once we have warp specialization and register stealing fully plumbed in, but for the time being I simply left it at N=64.
Once the instruction tile is set we set the warp tile equal to the instruction tile (we can revisit this in the future). Then to find the CTA tile we double the instruction tile in the M or N dimension until we run out of registers.
We start with 8 circular buffering stages and decrease until the circular buffers fit into smem.
We use use_smem_epilogue when possible. Whenever that is possible we always use promote_prologue_smem_reuse even if it's not needed. This is to try and avoid bugs like Misaligned read from smem doing TMA store #3602.
I set the tile rasterization order so that the fast axis is the axis with the fewest tiles, which should encourage more L2 hits unless there are tons of tiles in each dimension.
I cannot yet set grid swizzling due to Inlining error in Hopper matmul with AxisMapping and grid swizzling #3671, but I placed a TODO comment and some code to do the proper swizzling.

This enables Hopper matmul in our automatic scheduler by translating them without introducing new broadcasts. Specifically: 1. Update `mma_utils::MatmulPattern::translateToMmaOp` to optionally avoid intermediates by using an `MmaOp::AxisMapping`. Enable this option when the target arch is not Ampere or Turing. 3. Unguard some tests in `test_translate_mma.cpp` This does not update the default heuristic or change the `canSchedule` checks. See #3579 for that follow-up PR --------- Co-authored-by: Ryan Spring <rspring@nvidia.com> Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com> Co-authored-by: Jingyue Wu <wujingyue@gmail.com> Co-authored-by: nsarka <nsarkauskas@nvidia.com> Co-authored-by: Protonu <pbasu@nvidia.com> Co-authored-by: samnordmann <snordmann@nvidia.com>

Must have been a broken merge

I'm still skipping the ones with batch dimensions on A since these hit an error currently. Will investigate later but we only need 2d A for now.

jacobhinkle · 2024-12-19T20:00:30Z

csrc/scheduler/mma_utils.cpp

-        axis_mapping.a_axes.push_back(d);
-      }
-      axis_mapping.a_axes.reserve(out_dim);
-      for (size_t d : c10::irange(out_dim - 2)) {


I think this was just due to a busted merge.

jacobhinkle · 2024-12-19T20:36:24Z

csrc/scheduler/matmul_utils.cpp

+      macro_encode.n = 256;
+      while (macro_encode.n >= 8) {
+        if (n_extent % macro_encode.n != 0) {
+          macro_encode.n /= 2;


Currently this only chooses powers of two. For small problems I think we could choose one of the other sizes. For example if n_extent == 72 then we should probably use that size.

jacobhinkle · 2024-12-19T20:38:05Z

csrc/scheduler/matmul_utils.cpp

+
+    const auto tryIncreaseM = [&]() {
+      if (ratiosValid(m_ratio + 1, n_ratio)) {
+        m_ratio++;


Should these also be powers of two? Currently this will chooses sizes like 192

Should fix this for both matmul and linear, and for avoid_intermediates_ and not

The dtype for stmatrix should have never been constrained to only Half. The only constraint we have is that the dtype size is 16-bit. This PR is needed for us to actually use stmatrix in bfloat16 matmuls.

…mul_heuristic

jacobhinkle · 2024-12-30T16:00:01Z

!test

We should re-enable this when we plumb through cooperative launch and when we guard against invalid configs in the heuristic

jacobhinkle · 2025-01-07T02:01:55Z

!test

jacobhinkle · 2025-01-07T02:02:08Z

!test --matmul-bench

jacobhinkle · 2025-01-07T02:04:45Z

csrc/scheduler/matmul_heuristic.h

@@ -193,7 +193,7 @@ class MatmulParams : public HeuristicParams {

  //! This is the CGA size on Hopper+ devices. This parameter is ignored on
  //! Ampere and Turing.
-  std::tuple<int64_t, int64_t, int64_t> cluster_dims = {2, 1, 1};
+  std::tuple<int64_t, int64_t, int64_t> cluster_dims = {1, 1, 1};


If the grid size is not divisible by the cluster size then we get a launch error, so we should default to not use cluster dims unless explicitly handled by a heuristic.

rdspring1

LGTM. I just left some thoughts.

rdspring1 · 2025-01-07T17:22:28Z

csrc/scheduler/matmul_utils.cpp

+  // The Hopper register file is 256KiB. We reduce this by a factor of 1/2 to
+  // account for overhead, since not all of the registers will hold MMA
+  // outputs.
+  const size_t max_registers_per_sm = device_prop->regsPerMultiprocessor / 2L;


I wonder if getRegPerThreadGivenThreadsPerSM is more accurate than registers_per_sm / 2.

Suggested change

const size_t max_registers_per_sm = device_prop->regsPerMultiprocessor / 2L;

// tma warp group + 2 * compute warp groups

constexpr int64_t threads_per_sm = 384;

const size_t max_registers_per_sm = getRegPerThreadGivenThreadsPerSM(threads_per_sm) * threads_per_sm;

rdspring1 · 2025-01-07T17:23:01Z

csrc/scheduler/matmul_utils.cpp

+  // outputs.
+  const size_t max_registers_per_sm = device_prop->regsPerMultiprocessor / 2L;
+
+  const size_t regs_per_warp_group = warp_tile.m * warp_tile.n * num_problems;


Suggested change

const size_t regs_per_warp_group = warp_tile.m * warp_tile.n * num_problems;

// total accumulator registers for warp group

const size_t accum_regs_per_warp_group = warp_tile.m * warp_tile.n * num_problems;

rdspring1 · 2025-01-07T17:28:51Z

csrc/scheduler/matmul_utils.cpp

+
+  const size_t regs_per_warp_group = warp_tile.m * warp_tile.n * num_problems;
+
+  const auto ratiosValid = [&](const DimType m_ratio, const DimType n_ratio) {


nitpick: snake_case for lambda functions.

Suggested change

const auto ratiosValid = [&](const DimType m_ratio, const DimType n_ratio) {

const auto ratios_valid = [&](const DimType m_ratio, const DimType n_ratio) {

rdspring1 · 2025-01-07T17:29:19Z

csrc/scheduler/matmul_utils.cpp

+    DimType cta_n = warp_tile.n * n_ratio;
+    increased = false;
+
+    const auto tryIncreaseM = [&]() {


Suggested change

const auto tryIncreaseM = [&]() {

const auto try_increaseM = [&]() {

rdspring1 · 2025-01-07T17:29:30Z

csrc/scheduler/matmul_utils.cpp

+      }
+      return increased;
+    };
+    const auto tryIncreaseN = [&]() {


Suggested change

const auto tryIncreaseN = [&]() {

const auto try_increaseN = [&]() {

rdspring1 · 2025-01-07T17:33:53Z

csrc/scheduler/matmul_utils.cpp

+
+  const size_t regs_per_warp_group = warp_tile.m * warp_tile.n * num_problems;
+
+  const auto ratiosValid = [&](const DimType m_ratio, const DimType n_ratio) {


alternate name and description.

Suggested change

const auto ratiosValid = [&](const DimType m_ratio, const DimType n_ratio) {

// The cta tile is a multiple of the warp tile. This lambda checks that cta tile given by warp_tile and multiple fits on the SM.

const auto validate_cta_tile_multiple = [&](const DimType m_ratio, const DimType n_ratio) {

Accept Hopper matmuls and update default heuristic

cb13e25

jacobhinkle mentioned this pull request Dec 12, 2024

Enable translation of Hopper matmuls #3440

Merged

jacobhinkle added 6 commits December 18, 2024 09:05

Merge remote-tracking branch 'origin/main' into hopper_matmul_heuristic

cd2d1e1

Merge remote-tracking branch 'origin/main' into hopper_matmul_heuristic

1692b5d

Fix bug in translating hopper LinearOps

c8097b9

Must have been a broken merge

Factor default heuristic by arch

3751734

Unguard 2dA_2dB LinearOp translation tests on Hopper

076d56a

I'm still skipping the ones with batch dimensions on A since these hit an error currently. Will investigate later but we only need 2d A for now.

Check prologues in compile-time check

700df1f

jacobhinkle commented Dec 19, 2024

View reviewed changes

Fix up getMmaOp

89f4887

jacobhinkle commented Dec 19, 2024

View reviewed changes

Revert innocuous change to ampere path

15200fe

jacobhinkle commented Dec 19, 2024

View reviewed changes

jacobhinkle added 4 commits December 19, 2024 16:00

Fix condition in prologue check

bba9c88

Merge remote-tracking branch 'origin/main' into hopper_matmul_heuristic

e5def4c

Fix up merge

9a691c6

Add incomplete fix for repeated operands in patterns

80c1232

Should fix this for both matmul and linear, and for avoid_intermediates_ and not

jacobhinkle changed the title ~~[WIP] Accept Hopper matmuls and update default heuristic~~ Accept Hopper matmuls and update default heuristic Dec 20, 2024

jacobhinkle added 9 commits December 23, 2024 09:55

Enable BFloat16 in stmatrix

fbde1e2

The dtype for stmatrix should have never been constrained to only Half. The only constraint we have is that the dtype size is 16-bit. This PR is needed for us to actually use stmatrix in bfloat16 matmuls.

Remove mistakenly-pasted line

e311250

Merge remote-tracking branch 'origin/stmatrix_bfloat' into hopper_mat…

7305778

…mul_heuristic

Fix compile error

3f7b6a6

Merge remote-tracking branch 'origin/stmatrix_bfloat' into hopper_mat…

d8f80e2

…mul_heuristic

Merge remote-tracking branch 'origin/main' into hopper_matmul_heuristic

6c17823

clang-tidy

486e4d9

Merge remote-tracking branch 'origin/main' into hopper_matmul_heuristic

5c6f504

clang-tidy

c9f1805

Add test that we skip fusing matmuls with 64-bit indexing

0f1ad25

jacobhinkle added 10 commits January 2, 2025 11:57

Default to not setting cluster_dims

7ce8938

We should re-enable this when we plumb through cooperative launch and when we guard against invalid configs in the heuristic

Keep instruction size >=64 for default heuristic

da8f27f

Don't increase CTA beyond 256

7c3cd60

Default to 6 stages. This will be limited based on smem usage later

f11516d

Skip hopper matmuls that need int64 indexing

0252ad4

Merge remote-tracking branch 'origin/main' into hopper_matmul_heuristic

9e0118d

Limit macro to N=64

8f4b796

Increase by multiples of 2, respect device num registers

10489a9

Set circular buffer stages to max, respecting smem limit

c77f6e4

Always promote smem reuse on Hopper

645bc74

jacobhinkle marked this pull request as ready for review January 7, 2025 02:01

jacobhinkle requested a review from rdspring1 January 7, 2025 02:03

jacobhinkle commented Jan 7, 2025

View reviewed changes

rdspring1 approved these changes Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept Hopper matmuls and update default heuristic #3579

Accept Hopper matmuls and update default heuristic #3579

jacobhinkle commented Dec 12, 2024 •

edited

Loading

jacobhinkle Dec 19, 2024

jacobhinkle Dec 19, 2024

jacobhinkle Dec 19, 2024

jacobhinkle commented Dec 30, 2024

jacobhinkle commented Jan 7, 2025

jacobhinkle commented Jan 7, 2025

jacobhinkle Jan 7, 2025

rdspring1 left a comment

rdspring1 Jan 7, 2025

rdspring1 Jan 7, 2025

rdspring1 Jan 7, 2025

rdspring1 Jan 7, 2025

rdspring1 Jan 7, 2025

rdspring1 Jan 7, 2025

-  const size_t max_registers_per_sm = device_prop->regsPerMultiprocessor / 2L;
+  // tma warp group + 2 * compute warp groups
+  constexpr int64_t threads_per_sm = 384;
+  const size_t max_registers_per_sm = getRegPerThreadGivenThreadsPerSM(threads_per_sm) * threads_per_sm;

	const size_t regs_per_warp_group = warp_tile.m * warp_tile.n * num_problems;
	// total accumulator registers for warp group
	const size_t accum_regs_per_warp_group = warp_tile.m * warp_tile.n * num_problems;


		const size_t regs_per_warp_group = warp_tile.m * warp_tile.n * num_problems;

		const auto ratiosValid = [&](const DimType m_ratio, const DimType n_ratio) {

	const auto ratiosValid = [&](const DimType m_ratio, const DimType n_ratio) {
	const auto ratios_valid = [&](const DimType m_ratio, const DimType n_ratio) {

	const auto tryIncreaseM = [&]() {
	const auto try_increaseM = [&]() {

	const auto tryIncreaseN = [&]() {
	const auto try_increaseN = [&]() {

	const auto ratiosValid = [&](const DimType m_ratio, const DimType n_ratio) {
	// The cta tile is a multiple of the warp tile. This lambda checks that cta tile given by warp_tile and multiple fits on the SM.
	const auto validate_cta_tile_multiple = [&](const DimType m_ratio, const DimType n_ratio) {

Accept Hopper matmuls and update default heuristic #3579

Are you sure you want to change the base?

Accept Hopper matmuls and update default heuristic #3579

Conversation

jacobhinkle commented Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobhinkle commented Dec 30, 2024

jacobhinkle commented Jan 7, 2025

jacobhinkle commented Jan 7, 2025

Choose a reason for hiding this comment

rdspring1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobhinkle commented Dec 12, 2024 •

edited

Loading