Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ifu20240625 group gemm yewang12 #59

Open
wants to merge 25 commits into
base: dev
Choose a base branch
from

Conversation

wangye805
Copy link
Contributor

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Changes

Please list the changes introduced in this PR:

  • a4e95e8: the first commit for grouped linear from our upstream
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

cyanguwa and others added 25 commits June 13, 2024 21:25
* add attention docs

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: update attention doc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: update attention doc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: update attention doc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: update attn doc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: update attn doc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: update attn doc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: update attention doc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* first draft

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor tweak to first draft

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up pictures

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* first draft for review

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add logging info/debug

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fix of an SWA message

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* use subprocess instaed of os.sys

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up benchmark script

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add example script and update notebook

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor tweak

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor tweaks

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix lint

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix Jax/Paddle related comments

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* rerun H100 benchmark

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* restrict fp8 tests to sm90+

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move get_cudnn_version from common to pytorch utils

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* Initial config test

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* remove linters, fix clang-format

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix clang-format

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix clang-format

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Remove lint

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Adjust config

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* use config file

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* adjust pylintrc

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* pre-format fixes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Python only

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Add FA module

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fixes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update CI configs

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* CRLF -> LF

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* format

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* revert accidental formatting changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* try with sudo

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* cpp formatting

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix pylint error properly

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* some review comments

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* lint fixes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* add fp8 attn include in the correct file

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* autofix PRs

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Apply formatting

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Apply formatting

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* A hot fix to disable CE deadlock check

Signed-off-by: Pavel Shamis (Pasha) <pasharesearch@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pavel Shamis (Pasha) <pasharesearch@gmail.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* subclass DPA with BaseModule and test with test_gpt_checkpointing

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* test DPA only

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* test save and load

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove debug info

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor tweaks

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor tweak

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add hook in case core_attention._extra_state is missing

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* check named buffers in BaseModule; remove FP8 scratchpad override function; test FP8 for sm90+

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fixes: test size, interval in recipe, named_buffer loop

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* move BaseModule from FusedAttention to DPA

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…se_fused (#931)

* rm tensor check if the workspace is empty

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* add trust_remote=true for load_dataset() in the mnist test

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…937)

replaced plain C asserts with NVTE_CHECK to avoid unused-variable warnings

Signed-off-by: Alp Dener <adener@nvidia.com>
* Add the option to use SM for P2P comm in TP overlap

Signed-off-by: Sangkug Lym <slym@nvidia.com>

* cleanup

Signed-off-by: Sangkug Lym <slym@nvidia.com>

* Python formatting with black

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Format C++ with clang-format

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/csrc/comm_gemm_overlap.h

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

---------

Signed-off-by: Sangkug Lym <slym@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Remove optional UB build leftovers

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* rm unused import

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
fix tp_initialized error

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
* simplify offset tensors

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes; tests pass

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix C lint

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* replace with_offset with with_padding

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* replace with_padding with padded

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fixes after merge

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix for fused attn fwd/bwd calls

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix Jax

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* adjust spacing in docstring

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix pytorch tests; fix paddle api

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix attn_biases

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix AttnFuncWithCP backward

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix jax

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix attn with CP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix paddle

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Release GIL in PyTorch pybind11 functions

Signed-off-by: Tim Moon <tmoon@nvidia.com>
* adding option to select only .cpp files in a dir in the build tool

* change cmake build path

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* GroupedGEMM via multi-stream cublas

* fix A/B is nullptr while D is not nullptr

* add fp8 grouped gemm

* register with TorchScript

* add the GroupedLinear layer

---------

Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Jiang Shao <jiangs@nvidia.com>
Co-authored-by: Qi Zhang <qizhang@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
… Ignore MVTE_FLASH_ATTN env till FA is enabled for ROCm
Fix typo when selecting tuned RMSNorm kernels

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.