Reference CPU backend #24

shintaro-iwasaki · 2023-10-27T21:48:27Z

shintaro-iwasaki
Oct 27, 2023

I haven't created a PR against this official triton-shared repository, but I'd love to know your ideas about having a reference CPU backend for Triton-Shared, which can be called from the Python side.

Proof of Concept

The following implements a CPU backend without changing OpenAI's Triton runtime, by using triton-shared-opt and standard MLIR lowering rules. See examples.yml and its CI result for details: shintaro-iwasaki#1.

No changes in OpenAI's Triton runtime are needed.
It uses only triton-shared-opt and the standard MLIR/LLVM executables.
We can run a normal Python-Triton kernel via a CPU backend seamlessly using PyTorch's CPU tensor and check the result using Python code.

Note that it's WIP, so I don't plan to merge it as it is; first of all, it can run only empty kernels now.

Merits

We can use tests of Triton. We can be more confident of the op coverage.
It's testable and runnable on normal GitHub Actions without GPUs.
- Contributors don't need special hardware to try out.
Many love Triton's Python interface. It's easy to try for other people (compared with exposing only an MLIR interface)
Perhaps it'd attract potential Triton users who are interested in its CPU backend.
It can check MLIR standard compliance (i.e., whether operator usage follows what MLIR's standard lowering assumes)

Demerits

To make this work, some non-trivial fix and improvement in lowering would be needed (at least the current MLIR standard lowering doesn't work well for the generated linalg+memref code). Definitely some help from developers would be appreciated here.
It needs to be maintained, which increases the maintenance cost.
- The implementation depends on the internal logic of Triton, so it makes things less stable.
The operator usage can be somehow restricted by MLIR's standard lowering passes.
Though Intel XPU and AMD HIP already did something similar, this backend mechanism I wrote is a bit tricky.

Thanks for taking a look at it, and your ideas would be highly appreciated!

jungpark-mlir · 2023-10-29T15:05:01Z

jungpark-mlir
Oct 29, 2023

I'm not a member of triton-shared dev team but want to add +1 to this option as I'm also working on a similar approach for the GPU target. My project has GPU execution layer on Python, I've shared the example pipeline here. (Not with the input from triton though.)
Also, I've separately experimented mapping the kernels to gpu using upstream passes and minimal additional works to map memref -> device buffer, still using gpu dialect (they work!😄)
For now, I'm spending more time between runtime and python which I hope to share within few months, and then will come back to the triton backend. I appreciate your work and hope to contribute later.

0 replies

manbearian · 2023-10-30T21:29:14Z

manbearian
Oct 30, 2023
Collaborator

@shintaro-iwasaki this is really nice!

I love the way you've not only gotten the compilation to work but also tied everything together, generating the host code, using the plug-in model, etc. to make this all work end-to-end.

I've very excited to have something like this available as both

a basis for CPU targeting from Triton
a way to validate our triton-shared lowerings.

I'd love to move forward with accepting this into the triton-shared tree. I'm happy to start small here and see where this leads if that's what it takes to get started on this.

FYI... The pass list that one of my colleagues has been experimenting with for CPU codegen looks like this:

            "--convert-tensor-to-linalg",
            "--eliminate-empty-tensors",
            "--empty-tensor-to-alloc-tensor",
            "--one-shot-bufferize=allow-return-allocs-from-loops=true",
            "--convert-linalg-to-loops",
            "--convert-scf-to-cf",
            #"--convert-linalg-to-llvm",    # this is what i have in my notes, but this pass doesn't exist, could have been moved/changed?
            "--convert-cf-to-llvm",
            "--convert-arith-to-llvm",
            "--convert-math-to-llvm",
            "--convert-complex-to-llvm",
            "--convert-vector-to-llvm",
            "--convert-index-to-llvm",
            "--finalize-memref-to-llvm",
            "--convert-func-to-llvm",
            "--reconcile-unrealized-casts"

This list is very similar to yours (so that is a good sign), but there's some subtle differences that seem to improve the lowering. I plugged this pass list into your code and was able to get some more complex cases to compile (e.g., a version of vector-add kernel without masking). However, when running the generated CPU code i'm seeing a seg-fault on the Triton store operation when its executed. I didn't drill in beyond that though; i suspect this could indicate some bad pointer mapping, but then again, could be something else entirely.

3 replies

shintaro-iwasaki Oct 30, 2023
Author

Thanks for taking a look at it, @jungpark-mlir and @manbearian. Please let me clean up a PR and create a PR within one or two days, with a nicer list of passes you wrote.

I observed a few issues in the current lowering strategy:

Lack of support for tensorStore() in the standard MLIR lowering pass, perhaps related to this: https://discourse.llvm.org/t/memref-tensor-store-vs-bufferization-to-memref-memref-copy/60887/2
This lowering might not be correctly handling a memref argument (or the generated code is different from what MLIR assumes).

Please let me investigate these a bit more, but for the first issue, perhaps we need to write a small pass.

manbearian Nov 1, 2023
Collaborator

I added the bufferization pass in my pass list, which seems to allow the store to lower, but as i said, i'm getting the crash at runtime.

shintaro-iwasaki Nov 1, 2023
Author

Thanks! If the tensor-store can be lowered, we can perhaps lower pretty many things considering that #26 can now run a reduce kernel (1d tensor -> scalar in each kernel to avoid a tensor-store op) on CPU. It'd be really appreciated if you could share the complete list of lowering passes.

(#26 needs some legal things that I need to handle. Please give me some time to figure it out.)

shintaro-iwasaki · 2023-11-10T21:52:46Z

shintaro-iwasaki
Nov 10, 2023
Author

#26 had landed. Thank you very much for the great discussions!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference CPU backend #24

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Reference CPU backend #24

shintaro-iwasaki Oct 27, 2023

Proof of Concept

Merits

Demerits

Replies: 3 comments · 3 replies

jungpark-mlir Oct 29, 2023

manbearian Oct 30, 2023 Collaborator

shintaro-iwasaki Oct 30, 2023 Author

manbearian Nov 1, 2023 Collaborator

shintaro-iwasaki Nov 1, 2023 Author

shintaro-iwasaki Nov 10, 2023 Author

shintaro-iwasaki
Oct 27, 2023

Replies: 3 comments 3 replies

jungpark-mlir
Oct 29, 2023

manbearian
Oct 30, 2023
Collaborator

shintaro-iwasaki Oct 30, 2023
Author

manbearian Nov 1, 2023
Collaborator

shintaro-iwasaki Nov 1, 2023
Author

shintaro-iwasaki
Nov 10, 2023
Author