Ifu20240625 group gemm yewang12 #59

wangye805 · 2024-09-19T16:25:59Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

a4e95e8: the first commit for grouped linear from our upstream
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

* add attention docs Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attn doc Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update attention doc Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * first draft Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak to first draft Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up pictures Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * first draft for review Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add logging info/debug Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix of an SWA message Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use subprocess instaed of os.sys Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up benchmark script Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add example script and update notebook Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweaks Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix Jax/Paddle related comments Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rerun H100 benchmark Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * restrict fp8 tests to sm90+ Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move get_cudnn_version from common to pytorch utils Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Initial config test Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * remove linters, fix clang-format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix clang-format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix clang-format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove lint Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Adjust config Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * use config file Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * adjust pylintrc Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * pre-format fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Python only Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add FA module Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update CI configs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CRLF -> LF Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * revert accidental formatting changes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * try with sudo Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * cpp formatting Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix pylint error properly Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * some review comments Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * lint fixes Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * add fp8 attn include in the correct file Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * autofix PRs Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Apply formatting Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply formatting Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* A hot fix to disable CE deadlock check Signed-off-by: Pavel Shamis (Pasha) <pasharesearch@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pavel Shamis (Pasha) <pasharesearch@gmail.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* subclass DPA with BaseModule and test with test_gpt_checkpointing Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test DPA only Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test save and load Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove debug info Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweaks Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweak Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add hook in case core_attention._extra_state is missing Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * check named buffers in BaseModule; remove FP8 scratchpad override function; test FP8 for sm90+ Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes: test size, interval in recipe, named_buffer loop Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move BaseModule from FusedAttention to DPA Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…se_fused (#931) * rm tensor check if the workspace is empty Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * add trust_remote=true for load_dataset() in the mnist test Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…937) replaced plain C asserts with NVTE_CHECK to avoid unused-variable warnings Signed-off-by: Alp Dener <adener@nvidia.com>

* Add the option to use SM for P2P comm in TP overlap Signed-off-by: Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by: Sangkug Lym <slym@nvidia.com> * Python formatting with black Signed-off-by: Tim Moon <tmoon@nvidia.com> * Format C++ with clang-format Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/csrc/comm_gemm_overlap.h Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by: Sangkug Lym <slym@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove optional UB build leftovers Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rm unused import Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

fix tp_initialized error Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* simplify offset tensors Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes; tests pass Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix C lint Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace with_offset with with_padding Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace with_padding with padded Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes after merge Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix for fused attn fwd/bwd calls Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Jax Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust spacing in docstring Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix pytorch tests; fix paddle api Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix attn_biases Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix AttnFuncWithCP backward Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix attn with CP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix paddle Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Release GIL in PyTorch pybind11 functions Signed-off-by: Tim Moon <tmoon@nvidia.com>

* adding option to select only .cpp files in a dir in the build tool * change cmake build path --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* GroupedGEMM via multi-stream cublas * fix A/B is nullptr while D is not nullptr * add fp8 grouped gemm * register with TorchScript * add the GroupedLinear layer --------- Signed-off-by: Xin Yao <xiny@nvidia.com> Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by: Jiang Shao <jiangs@nvidia.com> Co-authored-by: Qi Zhang <qizhang@nvidia.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

unresolved). NVIDIA/TransformerEngine@a4e95e8

…gemm

…ests

… Ignore MVTE_FLASH_ATTN env till FA is enabled for ROCm

Fix typo when selecting tuned RMSNorm kernels Signed-off-by: Tim Moon <tmoon@nvidia.com>

cyanguwa and others added 25 commits June 13, 2024 21:25

Apply formatting (#929)

9416519

* Apply formatting Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply formatting Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[JAX] Fixing unused-variable warning at TE/JAX extension compile (#…

e2caf78

…937) replaced plain C asserts with NVTE_CHECK to avoid unused-variable warnings Signed-off-by: Alp Dener <adener@nvidia.com>

Changed version to 1.9.0.dev

29e8bfc

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

[PyTorch] Fix tp_group_initialized error (#939)

94a426b

fix tp_initialized error Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

[PyTorch] Release GIL in PyTorch extensions (#938)

6ee92c4

Release GIL in PyTorch pybind11 functions Signed-off-by: Tim Moon <tmoon@nvidia.com>

Improve JAX build tool (#942)

85aeb90

* adding option to select only .cpp files in a dir in the build tool * change cmake build path --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Merge commit 'a4e95e8' into ifu20240625_group_gemm (Conflicts

92a92d0

unresolved). NVIDIA/TransformerEngine@a4e95e8

Resolved conflicts for Merge commit 'a4e95e8' into ifu20240625_group_…

6a24374

…gemm

[ROCm] skip the fp8 fused attn tests and rocm fused attn cuda graph t…

577e2ff

…ests

Fixed wrong noop tensors concatenation if data_storage is not big enough

f0f020f

Address PR comments. Correct device capability checking for AMD GPUs.…

2a48ebf

… Ignore MVTE_FLASH_ATTN env till FA is enabled for ROCm

Remove unused variable

c0a4dd1

[Core] Fix bug when selecting tuned RMSNorm kernels (#983)

191cd6f

Fix typo when selecting tuned RMSNorm kernels Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix JAX examples, fix ROCm device capability check

abdbe31

[ROCm] resolve jax conflicts after IFU to grouped linear features

dfc87ea

[ROCm] merge the dev branch after IFU20240613

bf18926

wangye805 requested review from ipanfilo and wenchenvincent September 19, 2024 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ifu20240625 group gemm yewang12 #59

Ifu20240625 group gemm yewang12 #59

wangye805 commented Sep 19, 2024

Ifu20240625 group gemm yewang12 #59

Are you sure you want to change the base?

Ifu20240625 group gemm yewang12 #59

Conversation

wangye805 commented Sep 19, 2024

Description

Type of change

Changes

Checklist: