Support clang as a CUDA compiler #4

Artem-B · 2022-05-19T15:08:54Z

Clang is capable of CUDA compilation these days.

It would be great to add support for using it for CUDA compilation with bazel.

cloudhan · 2022-05-20T02:49:16Z

Yeah, there is plan. I have local modification but it doesn't make it now.

cloudhan · 2022-05-30T16:53:38Z

@Artem-B Considering you are one of the devs of clang cuda, I have a question about it. I now have a nearly working configuration for clang. The only problem I am facing is, for example:

__global__ void kernel() {
  // blahblah with impl
}

with nvcc, the compiled object have symbol

000000000000019a T __device_stub__Z6kernelv()
00000000000002aa T kernel()

but with clang version 15.0.0 (https://github.com/llvm/llvm-project.git 009d56da5c4ea3666c4753ce7564c8c20d7e0255)

0000000000000000 T __device_stub__kernel()

the kernel() is missing from the object, which is causing my nccl example a linker error for various __global__ functions. In that example a libnccl.so is produced first then link to various perf tests binaries. I am curious what it the root problem here.

Artem-B · 2022-05-30T18:49:50Z

nvcc and clang do have somewhat different behavior under the hood and you're likely dealing with more than one issue here.

First, kernels and stubs. Clang indeed no longer has kernel on the host side. https://cuda.godbolt.org/z/833ejYs1c
The main reason for that was to make it possible to distinguish the host-side stub from the actual kernel function in debugger on AMD GPUs. It does mean that CUDA objects compiled with clang will not be able to use <<<...>>> to launch kernels from objects compiled with nvcc. This is probably not a big deal, as mixing two different compilers in the same build is a bad idea to start with. I do not think that's the root cause of the problem you see.

NCCL does rely on RDC compilation (i.e. each source compiles to a GPU object file, instead of a fully linked GPU executable) and that part works very differently in clang vs nvcc. In a nutshell, object files need an extra final linking step, and a bit of extra host-side 'glue' code. NVCC does that under the hood. Clang does not, yet.

Here's how tensorflow implements cuda_rdc_library it uses to compile nccl: https://github.com/tensorflow/tensorflow/blob/ee7cf722e4ca9d02b6e62eb3d1b7506ead995422/third_party/nccl/build_defs.bzl.tpl#L261

@jhuber6 has been working on clang driver changes to make compile-to-object-and-link "just work" on the GPU side. E.g. llvm/llvm-project@b7c8c4d
It's an experimental feature at the moment, but it will make things much simpler, once it's ready.

Artem-B · 2022-05-30T18:51:33Z

Also, I believe CMake has recently added support for clang as the CUDA compiler. It may be worth checking whether/how they handle RDC compilation there.

jhuber6 · 2022-05-30T19:51:12Z

I think the expected way to perform RDC-mode compilation is via the CUDA_SEPARABLE_COMPILATION option. I think this is supported for Clang as well judging by this issue.

Artem-B · 2022-05-30T21:34:52Z

Judging by the commit that has implemented it in cmake they did use the same RDC compilation process that we've implemented in tensorflow that I've pointed to above.

You may as well just pick up tensorflow's implementation directly. @chsigg would probably be the most familiar with the details, if you have questions.

cloudhan · 2022-05-31T13:35:39Z

OK, problem solved, it turns out that I need compile all C code of nccl as cuda with -x cu. Otherwise, there will be linker error caused by global functions. Also metioned here.

cloudhan · 2022-06-10T07:15:28Z

@Artem-B I think this is addressed in c13ebaa

Use --@rules_cuda//cuda:compiler=clang to select compiler, See detect_clang macro for auto detecting, otherwise, you'll need to bring the toolchain config and registration yourself.

As you are a member of tensorflow, I am wondering if this can be mentioned or evaluated in tf. Might be good for bazel community ;)

Artem-B · 2022-06-10T14:32:55Z

@Artem-B I think this is addressed in c13ebaa

Do you mean that your build rules are ready to use?

I guess the right sequence to make it all work is to get these rules upstreamed into bazel, then port TF build to it. I'll see what I can do to bring it to the attention of the right people.

cloudhan · 2022-06-10T14:57:44Z

Not necessary production ready, but at least usable. It needs more users to test it out before I can say it is production ready. Because it is a build system, there are too many corner cases in it.

Artem-B · 2022-06-10T16:26:11Z

One thing that could serve as a motivation to adopt these changes would be to try getting Tensorflow to build using your rules, instead of the ones TF carries. It would be a pretty decent test of the real-world usability of the rules -- TF is probably the largest bazel user outside of Google and is almost certainly the largest user of clang for CUDA compilations. Having them convinced would go a long way towards convincing bazel owners that these rules should to be part of bazel.

Having a proof of concept at that level would also give TF owners rough idea how much work it would take to adopt it and whether it's worth it. One thing to keep in mind is that TF also has to work with our internal build. I don't know yet how hard it would be to switch to your rules. If it's a drop-in replacement of the cuda_library() implementation, it should be doable. NCCL and other RDC compilation users would need some more work, but it should be manageable, too.

cloudhan · 2022-07-12T15:50:50Z

@Artem-B Do we have prebuilt llvm package with NVPTX backend enabled. I'd like adding a building CI. So that I can confidently close this issue finally.

Artem-B · 2022-07-12T16:58:37Z

LLVM/Clang releases should have NVPTX built in. E.g https://github.com/llvm/llvm-project/releases/tag/llvmorg-14.0.6

On a side note, just a FYI that there's been a lot of offloading-related changes in clang driver lately that are going to make GPU compilation much closer to C++ compilation. E.g. RDC compilation would "just work" -- clang -c a.cu -o a.o; clang -s b.cu -o b.o; clang -o app a.o b.o would do the job, with compiler and linker taking care of the GPU-side linking. It's still work in progress, but when it is ready that would help to simplify the build process quite a bit and will make things like GPU-side LTO possible with clang.

jhuber6 · 2022-07-12T20:11:59Z

If you want to try out the new driver I would appreciate it. For compiling an application in RDC mode you can do the following.

clang++ a.cu b.cu --offload-new-driver -fgpu-rdc --offload-arch=sm_70 -c
clang++ a.o b.o --offload-link -lcudart

Right now what's missing from the new driver is support for textures / surfaces, Windows / MacOS support, and compiling in non-RDC mode. The benefits are simplified compilation, static library support, and LTO among others.

jsharpe · 2022-11-03T23:04:21Z

It'd be good to be able to load clang from https://github.com/grailbio/bazel-toolchain so that we can have a hermetic toolchain setup. I'll probably look into this at some point soon as we're already using that toolchain for our host builds and will be using rules_cuda soon within one of our product builds.

cloudhan · 2022-11-05T01:07:39Z

llvm apt clang is also built with NVPTX enabled, we can use that too.

cloudhan · 2023-08-09T15:52:10Z

This is partially fixed by #143. Later I will add a full integration test by adding nccl as an example. The cloudhan/nccl-example branch should be buildable with both clang and nvcc.

hypdeb · 2023-08-10T16:31:02Z

Is there any flags I should add besides maybe those:

build:clang --@rules_cuda//cuda:compiler='clang'
build:clang --@rules_cuda//cuda:archs=compute_61:compute_61,sm_61
build:clang --@rules_cuda//cuda:runtime=@local_cuda//:cuda_runtime_static
build:clang --@rules_cuda//cuda:copts='-stdlib=libc++'

in theory for this to work ? I'm having a weird issue: eveything compiles fine, but then on execution it just dies without any output. Maybe I'm living a bit too close to the edge using clang 17 and CUDA 12.1 ? It does say it's only partially supported...

My whole setup is available here: https://github.com/hypdeb/lawrencium.

jhuber6 · 2023-08-10T16:35:47Z

Is there any flags I should add besides maybe those:
build:clang --@rules_cuda//cuda:compiler='clang'
build:clang --@rules_cuda//cuda:archs=compute_61:compute_61,sm_61
build:clang --@rules_cuda//cuda:runtime=@local_cuda//:cuda_runtime_static
build:clang --@rules_cuda//cuda:copts='-stdlib=libc++'
in theory for this to work ? I'm having a weird issue: eveything compiles fine, but then on execution it just dies without any output. Maybe I'm living a bit too close to the edge using clang 17 and CUDA 12.1 ? It does say it's only partially supported...

At least running something like this should be definitely supported. The only time I've seen errors like this in the past is when there's no supported architecture it tends to just silently die. E.g. if I compile for sm_52 but I have an sm_70 card. For executing a basic program I would expect something like the following to work,

$ clang -x cuda cuda.cpp --offload-arch=native -L/opt/cuda/lib -lcudart

If you're using RDC-mode w/ clang you'll need to opt-in.

$ clang -x cuda cuda.cpp --offload-arch=native -L/opt/cuda/lib -lcudart --offload-new-driver -fgpu-rdc

Using native should auto-detect what card you have installed, it won't work if you're building on a different machine than what you run on.

hypdeb · 2023-08-10T16:51:07Z

Thanks for the extremely fast and detailed response. I just tried a few things based on your inputs, but no luck. I should add that I'm working in Ubuntu 22.04 in WSL 2 if it's relevant. The same code was running fine a few versions ago using nvcc.

jhuber6 · 2023-08-10T16:52:48Z

Does the tool ./bin/nvptx-arch return anything? Pretty good litmus for if we're detecting the GPU and runtime correctly at all.

cloudhan · 2023-08-10T17:00:29Z

@hypdeb Could you please try --repo_env=CC=clang? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler

jhuber6 · 2023-08-10T17:04:50Z

@hypdeb Could you please try --repo_env=CC=clang? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler

If we're doing -fno-gpu-rdc (default) I would expect it to work because all the CUDA specific handling is done per-TU. But it's worth a shot.

hypdeb · 2023-08-10T17:08:05Z

@hypdeb Could you please try --repo_env=CC=clang? I am not sure how it goes if you mix gcc as cc compiler and clang as cuda compiler

I am already using clang as my cc compiler: https://github.com/hypdeb/lawrencium/blob/1694b0f1707d2bc6d2a782a734749ae1c1379336/toolchain/cc_toolchain_config.bzl#L24

Does the tool ./bin/nvptx-arch return anything? Pretty good litmus for if we're detecting the GPU and runtime correctly at all.

nvptx-arch returns sm_75. I tried with the --offload-arch=native flag too without success.

llc -version returns:

    nvptx       - NVIDIA PTX 32-bit
    nvptx64     - NVIDIA PTX 64-bit

among many others.

hypdeb · 2023-08-10T17:22:15Z

Here are the exact commands run by Bazel:

/usr/local/llvm/bin/clang \
    -x cu \
    '--cuda-path=/usr/local/cuda-12.1' \
    '-frandom-seed=bazel-out/amd64-fastbuild/bin/src/cuda/_objs/thrust_cu/thrust.o' \
    -iquote . \
    -iquote bazel-out/amd64-fastbuild/bin \
    -iquote external/local_cuda \
    -iquote bazel-out/amd64-fastbuild/bin/external/local_cuda \
    -isystem external/local_cuda/cuda/include \
    -isystem bazel-out/amd64-fastbuild/bin/external/local_cuda/cuda/include \
    -U_FORTIFY_SOURCE \
    -fstack-protector \
    -Wall \
    -Wthread-safety \
    -Wself-assign \
    -Wunused-but-set-parameter \
    -Wno-free-nonheap-object \
    -fcolor-diagnostics \
    -fno-omit-frame-pointer \
    '-stdlib=libc++' \
    '--offload-arch=native' \
    -c src/cuda/thrust.cu \
    -o bazel-out/amd64-fastbuild/bin/src/cuda/_objs/thrust_cu/thrust.o \
    -fPIC

and then

/usr/local/llvm/bin/clang -o bazel-out/amd64-fastbuild/bin/src/cuda/thrust_main \
    bazel-out/amd64-fastbuild/bin/src/cuda/libthrust_cu.a \
    external/local_cuda/cuda/lib64/libcudart_static.a \
    external/local_cuda/cuda/lib64/libcudadevrt.a \
    -ldl -lpthread -lrt \
    -Wl,-S \
    '-std=c++23' \
    '-stdlib=libc++' \
    '-fuse-ld=lld' \
    -lc++ -lc++abi \
    -static -lm \
    -no-canonical-prefixes \
    -L/usr/local/llvm/lib

hypdeb · 2023-08-10T17:28:14Z

I think it's unlikely the issue is with rules_cuda at this point to be hones and I don't want to pollute this thread too much. I should probably try re-building my whole environment from scratch, it's possible I screwed something up along the way as it was a long journey :D Thanks for the amazing rules by the way, it works great with nvcc :)

cloudhan · 2023-09-04T17:43:52Z

Close with #158

cloudhan closed this as completed Sep 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support clang as a CUDA compiler #4

Support clang as a CUDA compiler #4

Artem-B commented May 19, 2022

cloudhan commented May 20, 2022

cloudhan commented May 30, 2022

Artem-B commented May 30, 2022

Artem-B commented May 30, 2022

jhuber6 commented May 30, 2022

Artem-B commented May 30, 2022

cloudhan commented May 31, 2022

cloudhan commented Jun 10, 2022 •

edited

Loading

Artem-B commented Jun 10, 2022

cloudhan commented Jun 10, 2022

Artem-B commented Jun 10, 2022

cloudhan commented Jul 12, 2022

Artem-B commented Jul 12, 2022

jhuber6 commented Jul 12, 2022 •

edited

Loading

jsharpe commented Nov 3, 2022

cloudhan commented Nov 5, 2022

cloudhan commented Aug 9, 2023

hypdeb commented Aug 10, 2023 •

edited

Loading

jhuber6 commented Aug 10, 2023

hypdeb commented Aug 10, 2023

jhuber6 commented Aug 10, 2023

cloudhan commented Aug 10, 2023

jhuber6 commented Aug 10, 2023

hypdeb commented Aug 10, 2023 •

edited

Loading

hypdeb commented Aug 10, 2023

hypdeb commented Aug 10, 2023 •

edited

Loading

cloudhan commented Sep 4, 2023

Support clang as a CUDA compiler #4

Support clang as a CUDA compiler #4

Comments

Artem-B commented May 19, 2022

cloudhan commented May 20, 2022

cloudhan commented May 30, 2022

Artem-B commented May 30, 2022

Artem-B commented May 30, 2022

jhuber6 commented May 30, 2022

Artem-B commented May 30, 2022

cloudhan commented May 31, 2022

cloudhan commented Jun 10, 2022 • edited Loading

Artem-B commented Jun 10, 2022

cloudhan commented Jun 10, 2022

Artem-B commented Jun 10, 2022

cloudhan commented Jul 12, 2022

Artem-B commented Jul 12, 2022

jhuber6 commented Jul 12, 2022 • edited Loading

jsharpe commented Nov 3, 2022

cloudhan commented Nov 5, 2022

cloudhan commented Aug 9, 2023

hypdeb commented Aug 10, 2023 • edited Loading

jhuber6 commented Aug 10, 2023

hypdeb commented Aug 10, 2023

jhuber6 commented Aug 10, 2023

cloudhan commented Aug 10, 2023

jhuber6 commented Aug 10, 2023

hypdeb commented Aug 10, 2023 • edited Loading

hypdeb commented Aug 10, 2023

hypdeb commented Aug 10, 2023 • edited Loading

cloudhan commented Sep 4, 2023

cloudhan commented Jun 10, 2022 •

edited

Loading

jhuber6 commented Jul 12, 2022 •

edited

Loading

hypdeb commented Aug 10, 2023 •

edited

Loading

hypdeb commented Aug 10, 2023 •

edited

Loading

hypdeb commented Aug 10, 2023 •

edited

Loading