Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sum reduction operator to TritonBench #2282

Closed
wants to merge 1 commit into from

Conversation

jananisriram
Copy link
Contributor

@jananisriram jananisriram commented Jun 6, 2024

Summary:
Add a Triton reduction kernel for the sum operator where dim=None to TritonBench, following the TritonBench guide. This implementation works for all matrices being reduced to a scalar value.

To measure accuracy of Triton reduction kernel, add accuracy metric to sum kernel in TritonBench in order to test accuracy of Triton implementation against baseline PyTorch implementation, referencing torchbenchmark/operators/gemm/operator.py. Reset output registers per run of the Triton kernel for accurate Triton output.

To measure performance of the Triton reduction kernel against PyTorch, add gbps metric, referencing torchbenchmark/operators/vector_add/operator.py.

Referenced the existing vector_add and grouped_gemm TritonBench operators as frameworks for implementation.

See the TritonBench Operator Coverage Tracker for current operator coverage in TritonBench.

Reviewed By: xuzhao9, davidberard98

Differential Revision: D58048782

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58048782

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58048782

jananisriram added a commit to jananisriram/benchmark that referenced this pull request Jun 6, 2024
Summary:

Add a Triton reduction kernel for the `sum` operator where `dim=None` to TritonBench, following the [TritonBench guide](https://fb.workplace.com/notes/953949486404240). This implementation works for all matrices being reduced to a scalar value.

To measure accuracy of Triton reduction kernel, add accuracy metric to sum kernel in TritonBench in order to test accuracy of Triton implementation against baseline PyTorch implementation, referencing [`torchbenchmark/operators/gemm/operator.py`](https://www.internalfb.com/code/fbsource/[767bb6faa353685b84f08a39f36fdcf6ca170c85]/fbcode/pytorch/benchmark/torchbenchmark/operators/gemm/operator.py?lines=236). Reset output registers per run of the Triton kernel for accurate Triton output.

Referenced the existing [vector_add](https://www.internalfb.com/code/fbsource/fbcode/pytorch/benchmark/torchbenchmark/operators/vector_add/) and [grouped_gemm](https://www.internalfb.com/code/fbsource/fbcode/pytorch/benchmark/torchbenchmark/operators/grouped_gemm/) TritonBench operators as frameworks for implementation.

See the [TritonBench Operator Coverage Tracker](https://docs.google.com/spreadsheets/d/1091POOPSPsUnlNVEKaz2X_DQXdIwFv-fGOH_g9by-Zo/edit#gid=0) for current operator coverage in TritonBench.

Reviewed By: xuzhao9, davidberard98

Differential Revision: D58048782
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58048782

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58048782

Summary:

Add a Triton reduction kernel for the `sum` operator where `dim=None` to TritonBench, following the [TritonBench guide](https://fb.workplace.com/notes/953949486404240). This implementation works for all matrices being reduced to a scalar value.

To measure accuracy of Triton reduction kernel, add accuracy metric to sum kernel in TritonBench in order to test accuracy of Triton implementation against baseline PyTorch implementation, referencing [`torchbenchmark/operators/gemm/operator.py`](https://www.internalfb.com/code/fbsource/[767bb6faa353685b84f08a39f36fdcf6ca170c85]/fbcode/pytorch/benchmark/torchbenchmark/operators/gemm/operator.py?lines=236). Reset output registers per run of the Triton kernel for accurate Triton output.

To measure performance of the Triton reduction kernel against PyTorch, add gbps metric, referencing [`torchbenchmark/operators/vector_add/operator.py`](https://www.internalfb.com/code/fbsource/[858eda681c7618f9427ba55cef8d4aba712cb26e]/fbcode/pytorch/benchmark/torchbenchmark/operators/vector_add/operator.py?lines=19).

Referenced the existing [vector_add](https://www.internalfb.com/code/fbsource/fbcode/pytorch/benchmark/torchbenchmark/operators/vector_add/) and [grouped_gemm](https://www.internalfb.com/code/fbsource/fbcode/pytorch/benchmark/torchbenchmark/operators/grouped_gemm/) TritonBench operators as frameworks for implementation.

See the [TritonBench Operator Coverage Tracker](https://docs.google.com/spreadsheets/d/1091POOPSPsUnlNVEKaz2X_DQXdIwFv-fGOH_g9by-Zo/edit#gid=0) for current operator coverage in TritonBench.

Reviewed By: xuzhao9, davidberard98

Differential Revision: D58048782
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58048782

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 2d8999b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants