persistent_use_of_buffer is accumulated over all the resolution points. #4

jjsjann123 · 2023-03-14T09:24:21Z

Cherry-picking from: csarofeen/pytorch#2576

Author: Naoya Maruyama naoyam@users.noreply.github.com
Date: Mon Mar 13 17:50:01 2023 -0700

persistent_use_of_buffer is accumulated over all the resolution points. (#2576)

Recomputation for each persistent use should be done after the
accumulation is done.

Currently, recomputation and replaceVal can be done redundantly. For
example, on A100, that happens with NvFuserScheduler_BatchNorm_fp32/64/32/256.

* fix tests for multicluster fusion

…s. (#2576) Recomputation for each persistent use should be done after the accumulation is done. Currently, recomputation and replaceVal can be done redundantly. For example, on A100, that happens with NvFuserScheduler_BatchNorm_fp32/64/32/256.

jjsjann123 · 2023-03-14T10:20:56Z

I think this one is good. I'm going to merge it with the fouling commit (#3) reverted.

This reverts commit 839a299.

…s. (NVIDIA#4) Cherry-picking from: csarofeen/pytorch#2576 Author: Naoya Maruyama naoyam@users.noreply.github.com Date: Mon Mar 13 17:50:01 2023 -0700 persistent_use_of_buffer is accumulated over all the resolution points. (NVIDIA#2576) Recomputation for each persistent use should be done after the accumulation is done. Currently, recomputation and replaceVal can be done redundantly. For example, on A100, that happens with NvFuserScheduler_BatchNorm_fp32/64/32/256. Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>

``` Traceback (most recent call last): File "/opt/pytorch/nvfuser/nvfuser/__init__.py", line 122, in execute result = self._execute( RuntimeError: isSame(values_[it.first], it.second) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/evaluator_common.cpp":314, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Precomputed values failed to validate. Something unexpected changed between the compilation and execution. nan != nan Exception raised from validate at /opt/pytorch/nvfuser/csrc/evaluator_common.cpp:314 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x8d (0x7fdc9919fe3b in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x53 (0x7fdc992ded63 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #2: nvfuser::PrecomputedValues::validate() + 0x172 (0x7fdc993190f2 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #3: nvfuser::PrecomputedValues::evaluate() + 0x66 (0x7fdc9931fde6 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #4: nvfuser::FusionExecutor::inferOutputSizes(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&) + 0x8d (0x7fdc992ea12d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #5: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0x46d (0x7fdc9943a6ad in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #6: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xa8d (0x7fdc99443c9d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #7: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, bool, bool, std::optional<signed char>) const + 0x331 (0x7fdc997450e1 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #8: <unknown function> + 0xeec2e (0x7fdbe8274c2e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #9: <unknown function> + 0x16e137 (0x7fdbe82f4137 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) <omitting python frames> frame #38: <unknown function> + 0x29d90 (0x7fdd26ea0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #39: __libc_start_main + 0x80 (0x7fdd26ea0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) ```

…_test_profiling Rebase reduction, add test profiling, and use FusionExecutorCache for allocation

This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization. To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the allocator using `NVFUSER_DUMP=global_zeroed_memory`. Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know. This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See `NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`, which succeeds when using serial grid reduction, but fails (in debug mode) when using `gridReduce` (note that this test is updated to behave differently in this PR): ``` # NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc Note: Google Test filter = SerialGridReductionTest.Scheduling [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from SerialGridReductionTest [ RUN ] SerialGridReductionTest.Scheduling [global zeroed memory] Resizing arena to 512 bytes [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Resizing arena to 16384 bytes [global zeroed memory] Allocating byte range: 0 to 16384 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 16384 bytes unknown file: Failure C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first): frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests) frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests) frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests) frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests) frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests) frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests) frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests) frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests) frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests) frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests) frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests) frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests) frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests) frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests) frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests) frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests) frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests) frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests) frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests) frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests) frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling' [ FAILED ] SerialGridReductionTest.Scheduling (5669 ms) [----------] 1 test from SerialGridReductionTest (5669 ms total) ``` This test runs with serial grid reduction, then with `gridReduce`. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. The `gridReduce` succeeds the first time since the memory pool calls `at::zeros` again to request a larger buffer size (`gridReduce` requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call to `gridReduce` fails because it has not cleaned up its semaphores. Hacking that function to force `PERSISTENT=1` would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.

Type error was detected with #3263 while I was testing it with a Debug build. ``` pytest -v tests/python/test_python_frontend.py -k test_pad_dynamic ``` It has a fusion of: ``` Inputs: T0_g_float[ bS0{1}, iS1{i1}, iS2{i2} ] Outputs: T1_g_float[ bS11{1}, iS12{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}, iS13{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )} ] %kernel_math { f7 = (float)(7); f9 = float(2.5) * f7; i11 = (int64_t)(f9); i14 = (nvfuser_index_t)(i11); i16 = (nvfuser_index_t)(i11); i18 = (nvfuser_index_t)(i11); i20 = (nvfuser_index_t)(i11); T2_l_float[ bS3{1}, iS5{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf, iS7{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf ] = pad( T0_g_float[ bS0{1}, iS1{i1}, iS2{i2} ], {0, 0, i14, i16, i18, i20} ) T1_g_float[ bS11{1}, iS12{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}, iS13{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )} ] = Set( T2_l_float[ bS3{1}, iS5{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf, iS7{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf ], cache_op=Streaming ) } // %kernel_math ``` Stack trace: ``` #0 __cxxabiv1::__cxa_throw (obj=0xabc6ea0, tinfo=0x7ffeba81c248 <typeinfo for nvfuser::nvfError>, dest=0x7ffeba059370 <nvfuser::nvfError::~nvfError()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80 #1 0x00007ffeba058665 in nvfuser::nvfCheckFail (func=0x7ffeb9927c33 "as", file=0x7ffeb99d9a1b "/raid/nmaruyama/debug1/csrc/utils.h", line=119, msg=0x7ffeb99c36e6 " INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug1/csrc/utils.h\":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ") at /raid/nmaruyama/debug1/csrc/exceptions.cpp:283 #2 0x00007ffeb9c1be4b in nvfuser::nvfErrorFail (func=0x7ffeb9927c33 "as", file=0x7ffeb99d9a1b "/raid/nmaruyama/debug1/csrc/utils.h", line=119, condMsg=0x7ffeb99c36e6 " INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug1/csrc/utils.h\":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ") at /raid/nmaruyama/debug1/csrc/exceptions.h:229 #3 0x00007ffeb9c1bbe4 in nvfuser::PolymorphicBase::as<nvfuser::TensorView> (this=0xac07490) at /raid/nmaruyama/debug1/csrc/utils.h:119 #4 0x00007ffeba54c67f in nvfuser::(anonymous namespace)::isLoadGlobalToLocal (expr=0xabd7c50) at /raid/nmaruyama/debug1/csrc/scheduler/cache_policy_refiner.cpp:61 #5 0x00007ffeba54c599 in nvfuser::refineCachePolicy (fusion=0xabef940) at /raid/nmaruyama/debug1/csrc/scheduler/cache_policy_refiner.cpp:153 ```

samnordmann and others added 3 commits March 14, 2023 02:07

Fix multidevice tests (#2574)

e050187

* fix tests for multicluster fusion

IterDomain resize for pad, cat, slice (#2480)

839a299

jjsjann123 requested a review from naoyam March 14, 2023 09:25

Revert "IterDomain resize for pad, cat, slice (#2480)"

9ef183b

This reverts commit 839a299.

jjsjann123 changed the base branch from cherry_pick_2480 to main March 14, 2023 10:22

Merge branch 'main' into cherry_pick_2576

03ea9a6

jjsjann123 merged commit cd899f6 into main Mar 14, 2023

jjsjann123 deleted the cherry_pick_2576 branch March 14, 2023 10:26

liqiangxl mentioned this pull request May 8, 2023

MatmulSASSTest on H100 #306

Closed

jjsjann123 mentioned this pull request Nov 9, 2023

Codegen error: Exception raised from inferShape at /opt/pytorch/nvfuser/csrc/executor.cpp:532 #1277

Closed

cowanmeg pushed a commit to cowanmeg/Fuser that referenced this pull request Jan 5, 2024

Merge pull request NVIDIA#4 from samnordmann/rebase_reduction_and_add…

7894794

…_test_profiling Rebase reduction, add test profiling, and use FusionExecutorCache for allocation

DariusMaham mentioned this pull request Apr 18, 2024

Internal assert failed #2107

Closed

wujingyue mentioned this pull request May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

wujingyue mentioned this pull request Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persistent_use_of_buffer is accumulated over all the resolution points. #4

persistent_use_of_buffer is accumulated over all the resolution points. #4

jjsjann123 commented Mar 14, 2023

jjsjann123 commented Mar 14, 2023

persistent_use_of_buffer is accumulated over all the resolution points. #4

persistent_use_of_buffer is accumulated over all the resolution points. #4

Conversation

jjsjann123 commented Mar 14, 2023

jjsjann123 commented Mar 14, 2023