Reference CPU backend #24
Replies: 3 comments 3 replies
-
I'm not a member of |
Beta Was this translation helpful? Give feedback.
-
@shintaro-iwasaki this is really nice! I love the way you've not only gotten the compilation to work but also tied everything together, generating the host code, using the plug-in model, etc. to make this all work end-to-end. I've very excited to have something like this available as both
I'd love to move forward with accepting this into the triton-shared tree. I'm happy to start small here and see where this leads if that's what it takes to get started on this. FYI... The pass list that one of my colleagues has been experimenting with for CPU codegen looks like this: "--convert-tensor-to-linalg",
"--eliminate-empty-tensors",
"--empty-tensor-to-alloc-tensor",
"--one-shot-bufferize=allow-return-allocs-from-loops=true",
"--convert-linalg-to-loops",
"--convert-scf-to-cf",
#"--convert-linalg-to-llvm", # this is what i have in my notes, but this pass doesn't exist, could have been moved/changed?
"--convert-cf-to-llvm",
"--convert-arith-to-llvm",
"--convert-math-to-llvm",
"--convert-complex-to-llvm",
"--convert-vector-to-llvm",
"--convert-index-to-llvm",
"--finalize-memref-to-llvm",
"--convert-func-to-llvm",
"--reconcile-unrealized-casts" This list is very similar to yours (so that is a good sign), but there's some subtle differences that seem to improve the lowering. I plugged this pass list into your code and was able to get some more complex cases to compile (e.g., a version of vector-add kernel without masking). However, when running the generated CPU code i'm seeing a seg-fault on the Triton store operation when its executed. I didn't drill in beyond that though; i suspect this could indicate some bad pointer mapping, but then again, could be something else entirely. |
Beta Was this translation helpful? Give feedback.
-
#26 had landed. Thank you very much for the great discussions! |
Beta Was this translation helpful? Give feedback.
-
I haven't created a PR against this official
triton-shared
repository, but I'd love to know your ideas about having a reference CPU backend for Triton-Shared, which can be called from the Python side.Proof of Concept
The following implements a CPU backend without changing OpenAI's Triton runtime, by using
triton-shared-opt
and standard MLIR lowering rules. Seeexamples.yml
and its CI result for details: shintaro-iwasaki#1.triton-shared-opt
and the standard MLIR/LLVM executables.Note that it's WIP, so I don't plan to merge it as it is; first of all, it can run only empty kernels now.
Merits
Demerits
Thanks for taking a look at it, and your ideas would be highly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions