Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support clang as a CUDA compiler #4

Closed
Artem-B opened this issue May 19, 2022 · 27 comments
Closed

Support clang as a CUDA compiler #4

Artem-B opened this issue May 19, 2022 · 27 comments

Comments

@Artem-B
Copy link

Artem-B commented May 19, 2022

Clang is capable of CUDA compilation these days.

It would be great to add support for using it for CUDA compilation with bazel.

@cloudhan
Copy link
Collaborator

Yeah, there is plan. I have local modification but it doesn't make it now.

@cloudhan
Copy link
Collaborator

@Artem-B Considering you are one of the devs of clang cuda, I have a question about it. I now have a nearly working configuration for clang. The only problem I am facing is, for example:

__global__ void kernel() {
  // blahblah with impl
}

with nvcc, the compiled object have symbol

000000000000019a T __device_stub__Z6kernelv()
00000000000002aa T kernel()

but with clang version 15.0.0 (https://github.com/llvm/llvm-project.git 009d56da5c4ea3666c4753ce7564c8c20d7e0255)

0000000000000000 T __device_stub__kernel()

the kernel() is missing from the object, which is causing my nccl example a linker error for various __global__ functions. In that example a libnccl.so is produced first then link to various perf tests binaries. I am curious what it the root problem here.

@Artem-B
Copy link
Author

Artem-B commented May 30, 2022

nvcc and clang do have somewhat different behavior under the hood and you're likely dealing with more than one issue here.

First, kernels and stubs. Clang indeed no longer has kernel on the host side. https://cuda.godbolt.org/z/833ejYs1c
The main reason for that was to make it possible to distinguish the host-side stub from the actual kernel function in debugger on AMD GPUs. It does mean that CUDA objects compiled with clang will not be able to use <<<...>>> to launch kernels from objects compiled with nvcc. This is probably not a big deal, as mixing two different compilers in the same build is a bad idea to start with. I do not think that's the root cause of the problem you see.

NCCL does rely on RDC compilation (i.e. each source compiles to a GPU object file, instead of a fully linked GPU executable) and that part works very differently in clang vs nvcc. In a nutshell, object files need an extra final linking step, and a bit of extra host-side 'glue' code. NVCC does that under the hood. Clang does not, yet.

Here's how tensorflow implements cuda_rdc_library it uses to compile nccl: https://github.com/tensorflow/tensorflow/blob/ee7cf722e4ca9d02b6e62eb3d1b7506ead995422/third_party/nccl/build_defs.bzl.tpl#L261

@jhuber6 has been working on clang driver changes to make compile-to-object-and-link "just work" on the GPU side. E.g. llvm/llvm-project@b7c8c4d
It's an experimental feature at the moment, but it will make things much simpler, once it's ready.

@Artem-B
Copy link
Author

Artem-B commented May 30, 2022

Also, I believe CMake has recently added support for clang as the CUDA compiler. It may be worth checking whether/how they handle RDC compilation there.

@jhuber6
Copy link

jhuber6 commented May 30, 2022

I think the expected way to perform RDC-mode compilation is via the CUDA_SEPARABLE_COMPILATION option. I think this is supported for Clang as well judging by this issue.

@Artem-B
Copy link
Author

Artem-B commented May 30, 2022

Judging by the commit that has implemented it in cmake they did use the same RDC compilation process that we've implemented in tensorflow that I've pointed to above.

You may as well just pick up tensorflow's implementation directly. @chsigg would probably be the most familiar with the details, if you have questions.

@cloudhan
Copy link
Collaborator

OK, problem solved, it turns out that I need compile all C code of nccl as cuda with -x cu. Otherwise, there will be linker error caused by global functions. Also metioned here.

@cloudhan
Copy link
Collaborator

cloudhan commented Jun 10, 2022

@Artem-B I think this is addressed in c13ebaa

Use --@rules_cuda//cuda:compiler=clang to select compiler, See detect_clang macro for auto detecting, otherwise, you'll need to bring the toolchain config and registration yourself.

As you are a member of tensorflow, I am wondering if this can be mentioned or evaluated in tf. Might be good for bazel community ;)

@Artem-B
Copy link
Author

Artem-B commented Jun 10, 2022

@Artem-B I think this is addressed in c13ebaa

Do you mean that your build rules are ready to use?

I guess the right sequence to make it all work is to get these rules upstreamed into bazel, then port TF build to it. I'll see what I can do to bring it to the attention of the right people.

@cloudhan
Copy link
Collaborator

Not necessary production ready, but at least usable. It needs more users to test it out before I can say it is production ready. Because it is a build system, there are too many corner cases in it.

@Artem-B
Copy link
Author

Artem-B commented Jun 10, 2022

One thing that could serve as a motivation to adopt these changes would be to try getting Tensorflow to build using your rules, instead of the ones TF carries. It would be a pretty decent test of the real-world usability of the rules -- TF is probably the largest bazel user outside of Google and is almost certainly the largest user of clang for CUDA compilations. Having them convinced would go a long way towards convincing bazel owners that these rules should to be part of bazel.

Having a proof of concept at that level would also give TF owners rough idea how much work it would take to adopt it and whether it's worth it. One thing to keep in mind is that TF also has to work with our internal build. I don't know yet how hard it would be to switch to your rules. If it's a drop-in replacement of the cuda_library() implementation, it should be doable. NCCL and other RDC compilation users would need some more work, but it should be manageable, too.

@cloudhan
Copy link
Collaborator

@Artem-B Do we have prebuilt llvm package with NVPTX backend enabled. I'd like adding a building CI. So that I can confidently close this issue finally.

@Artem-B
Copy link
Author

Artem-B commented Jul 12, 2022

LLVM/Clang releases should have NVPTX built in. E.g https://github.com/llvm/llvm-project/releases/tag/llvmorg-14.0.6

On a side note, just a FYI that there's been a lot of offloading-related changes in clang driver lately that are going to make GPU compilation much closer to C++ compilation. E.g. RDC compilation would "just work" -- clang -c a.cu -o a.o; clang -s b.cu -o b.o; clang -o app a.o b.o would do the job, with compiler and linker taking care of the GPU-side linking. It's still work in progress, but when it is ready that would help to simplify the build process quite a bit and will make things like GPU-side LTO possible with clang.

@jhuber6
Copy link

jhuber6 commented Jul 12, 2022

If you want to try out the new driver I would appreciate it. For compiling an application in RDC mode you can do the following.

clang++ a.cu b.cu --offload-new-driver -fgpu-rdc --offload-arch=sm_70 -c
clang++ a.o b.o --offload-link -lcudart

Right now what's missing from the new driver is support for textures / surfaces, Windows / MacOS support, and compiling in non-RDC mode. The benefits are simplified compilation, static library support, and LTO among others.

@jsharpe
Copy link
Member

jsharpe commented Nov 3, 2022

It'd be good to be able to load clang from https://github.com/grailbio/bazel-toolchain so that we can have a hermetic toolchain setup. I'll probably look into this at some point soon as we're already using that toolchain for our host builds and will be using rules_cuda soon within one of our product builds.

@cloudhan
Copy link
Collaborator

cloudhan commented Nov 5, 2022

llvm apt clang is also built with NVPTX enabled, we can use that too.

@cloudhan
Copy link
Collaborator

cloudhan commented Aug 9, 2023

This is partially fixed by #143. Later I will add a full integration test by adding nccl as an example. The cloudhan/nccl-example branch should be buildable with both clang and nvcc.

@hypdeb
Copy link
Contributor

hypdeb commented Aug 10, 2023

Is there any flags I should add besides maybe those:

build:clang --@rules_cuda//cuda:compiler='clang'
build:clang --@rules_cuda//cuda:archs=compute_61:compute_61,sm_61
build:clang --@rules_cuda//cuda:runtime=@local_cuda//:cuda_runtime_static
build:clang --@rules_cuda//cuda:copts='-stdlib=libc++'

in theory for this to work ? I'm having a weird issue: eveything compiles fine, but then on execution it just dies without any output. Maybe I'm living a bit too close to the edge using clang 17 and CUDA 12.1 ? It does say it's only partially supported...

My whole setup is available here: https://github.com/hypdeb/lawrencium.

@jhuber6
Copy link

jhuber6 commented Aug 10, 2023

Is there any flags I should add besides maybe those:

build:clang --@rules_cuda//cuda:compiler='clang'
build:clang --@rules_cuda//cuda:archs=compute_61:compute_61,sm_61
build:clang --@rules_cuda//cuda:runtime=@local_cuda//:cuda_runtime_static
build:clang --@rules_cuda//cuda:copts='-stdlib=libc++'

in theory for this to work ? I'm having a weird issue: eveything compiles fine, but then on execution it just dies without any output. Maybe I'm living a bit too close to the edge using clang 17 and CUDA 12.1 ? It does say it's only partially supported...

At least running something like this should be definitely supported. The only time I've seen errors like this in the past is when there's no supported architecture it tends to just silently die. E.g. if I compile for sm_52 but I have an sm_70 card. For executing a basic program I would expect something like the following to work,

$ clang -x cuda cuda.cpp --offload-arch=native -L/opt/cuda/lib -lcudart

If you're using RDC-mode w/ clang you'll need to opt-in.

$ clang -x cuda cuda.cpp --offload-arch=native -L/opt/cuda/lib -lcudart --offload-new-driver -fgpu-rdc

Using native should auto-detect what card you have installed, it won't work if you're building on a different machine than what you run on.

@hypdeb
Copy link
Contributor

hypdeb commented Aug 10, 2023

Thanks for the extremely fast and detailed response. I just tried a few things based on your inputs, but no luck. I should add that I'm working in Ubuntu 22.04 in WSL 2 if it's relevant. The same code was running fine a few versions ago using nvcc.

@jhuber6
Copy link

jhuber6 commented Aug 10, 2023

Does the tool ./bin/nvptx-arch return anything? Pretty good litmus for if we're detecting the GPU and runtime correctly at all.

@cloudhan
Copy link
Collaborator

@hypdeb Could you please try --repo_env=CC=clang? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler

@jhuber6
Copy link

jhuber6 commented Aug 10, 2023

@hypdeb Could you please try --repo_env=CC=clang? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler

If we're doing -fno-gpu-rdc (default) I would expect it to work because all the CUDA specific handling is done per-TU. But it's worth a shot.

@hypdeb
Copy link
Contributor

hypdeb commented Aug 10, 2023

@hypdeb Could you please try --repo_env=CC=clang? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler

I am already using clang as my cc compiler: https://github.com/hypdeb/lawrencium/blob/1694b0f1707d2bc6d2a782a734749ae1c1379336/toolchain/cc_toolchain_config.bzl#L24

Does the tool ./bin/nvptx-arch return anything? Pretty good litmus for if we're detecting the GPU and runtime correctly at all.

nvptx-arch returns sm_75. I tried with the --offload-arch=native flag too without success.

llc -version returns:

    nvptx       - NVIDIA PTX 32-bit
    nvptx64     - NVIDIA PTX 64-bit

among many others.

@hypdeb
Copy link
Contributor

hypdeb commented Aug 10, 2023

Here are the exact commands run by Bazel:

/usr/local/llvm/bin/clang \
    -x cu \
    '--cuda-path=/usr/local/cuda-12.1' \
    '-frandom-seed=bazel-out/amd64-fastbuild/bin/src/cuda/_objs/thrust_cu/thrust.o' \
    -iquote . \
    -iquote bazel-out/amd64-fastbuild/bin \
    -iquote external/local_cuda \
    -iquote bazel-out/amd64-fastbuild/bin/external/local_cuda \
    -isystem external/local_cuda/cuda/include \
    -isystem bazel-out/amd64-fastbuild/bin/external/local_cuda/cuda/include \
    -U_FORTIFY_SOURCE \
    -fstack-protector \
    -Wall \
    -Wthread-safety \
    -Wself-assign \
    -Wunused-but-set-parameter \
    -Wno-free-nonheap-object \
    -fcolor-diagnostics \
    -fno-omit-frame-pointer \
    '-stdlib=libc++' \
    '--offload-arch=native' \
    -c src/cuda/thrust.cu \
    -o bazel-out/amd64-fastbuild/bin/src/cuda/_objs/thrust_cu/thrust.o \
    -fPIC

and then

/usr/local/llvm/bin/clang -o bazel-out/amd64-fastbuild/bin/src/cuda/thrust_main \
    bazel-out/amd64-fastbuild/bin/src/cuda/libthrust_cu.a \
    external/local_cuda/cuda/lib64/libcudart_static.a \
    external/local_cuda/cuda/lib64/libcudadevrt.a \
    -ldl -lpthread -lrt \
    -Wl,-S \
    '-std=c++23' \
    '-stdlib=libc++' \
    '-fuse-ld=lld' \
    -lc++ -lc++abi \
    -static -lm \
    -no-canonical-prefixes \
    -L/usr/local/llvm/lib

@hypdeb
Copy link
Contributor

hypdeb commented Aug 10, 2023

I think it's unlikely the issue is with rules_cuda at this point to be hones and I don't want to pollute this thread too much. I should probably try re-building my whole environment from scratch, it's possible I screwed something up along the way as it was a long journey :D Thanks for the amazing rules by the way, it works great with nvcc :)

@cloudhan
Copy link
Collaborator

cloudhan commented Sep 4, 2023

Close with #158

@cloudhan cloudhan closed this as completed Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants