Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORT GPU build #5622

Merged
merged 31 commits into from
Jan 22, 2025
Merged

ORT GPU build #5622

merged 31 commits into from
Jan 22, 2025

Conversation

ChSonnabend
Copy link
Collaborator

This is a draft PR to discuss possible changes to onnxruntime.sh for GPU builds on the EPN's and potentially CUDA (to be tested)

@ChSonnabend
Copy link
Collaborator Author

Ping @davidrohr

Copy link
Contributor

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Du solltest ein paar Umgebungsvariable, die wir in o2.sh nutzen, auch mitaufnehmen und analog behandeln: https://github.com/alisw/alidist/blob/1916f6d88d42959097998d9481b517dc1c1ea84d/o2.sh#L191C9-L191C30

  • ALIBUILD_O2_FORCE_GPU
  • DISABLE_GPU
  • ALIBUILD_ENABLE_CUDA
  • ALIBUILD_ENABLE_HIP
  • ALIBUILD_O2_OVERRIDE_HIP_ARCHS
  • ALIBUILD_O2_OVERRIDE_CUDA_ARCHS

Wenn ENABLE_CUDA oder ENABLE_HIP gesetzt ist, sollte der build fehlschlagen, wenn er CUDA/HIP nicht bauen kann.

onnxruntime.sh Outdated
"
elif command -v nvcc >/dev/null 2>&1; then
CUDA_VERSION=$(nvcc --version | grep "release" | awk '{print $NF}' | cut -d. -f1)
if [[ "$CUDA_VERSION" == "V11" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glaube CUDA 11 kannst du weglassen, und nur >=12 annehmen

onnxruntime.sh Outdated
ORT_BUILD_FLAGS=""
case $ARCHITECTURE in
osx_*)
if [[ $ARCHITECTURE == *_x86-64 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solche printouts würde ich weglassen, das ist ja hauptsächlich für debugging

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ja, aber ich nehme an das es auch einen macOS build gibt der die Mac GPU anspricht. Da muss ich nochmal ein bisschen rumsuchen, dann könnte man den if-Block nämlich nehmen um da die build flags rein zu packen. Aber ja, die print-outs nehm ich am Ende natürlich noch raus

onnxruntime.sh Outdated
fi
;;
*)
if command -v rocminfo >/dev/null 2>&1; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • rocm version check fehlt
  • Es ist nicht klar, ob rocminfo im Pfad liegt. Du solltest zumindest /opt/rocm/bin/rocminfo testen. Und dann ist migraphx ein separates ROCm paket. Sprich, wenn rocminfo vorhanden ist, heist das noch nicht, das migraphx vorhanden ist. Du solltest explicit auf migraphx testen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, das check ich nochmal

onnxruntime.sh Outdated
ORT_BUILD_FLAGS=" -Donnxruntime_USE_CUDA=ON \
-DCUDA_TOOLKIT_ROOT_DIR=$CUDA_ROOT \
-Donnxruntime_USE_CUDA_NHWC_OPS=ON \
-Donnxruntime_CUDA_USE_TENSORRT=ON \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wenn du tensorrt nutzt, musst du dann prüfen, ob das explicit installiert ist? Oder ist das immer beim CUDA SDK dabei?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheint nicht automatisch mitzukommen (https://docs.nvidia.com/deeplearning/tensorrt/install-guide/index.html)... Ok da bau ich auch noch einen Check mit ein

onnxruntime.sh Outdated
-Donnxruntime_USE_CUDA_NHWC_OPS=ON \
-Donnxruntime_CUDA_USE_TENSORRT=ON \
"
elif [[ "$CUDA_VERSION" == "V12" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was ist wenn ROCm und CUDA beides vorhanden ist? Können wir dann nicht beides bauen?

Copy link
Collaborator Author

@ChSonnabend ChSonnabend Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktf
Copy link
Member

ktf commented Sep 17, 2024

Gneau...

…adding env-variables for GPU enabling during code execution. For al9_gpu container and simultaneous CUDA & ROCm build, this requires ChSonnabend/onnxruntime@6ffc40c
…e build with CUDA and ROCm fails due to a ROCm internal check for THRUST and CUB libraries, which are not in sync (file: /opt/rocm/include/thrust/system/cuda/config.h)
onnxruntime.sh Outdated
export ORT_MIGRAPHX_BUILD=0
fi
### TensorRT
if [ "$ORT_CUDA_BUILD" -eq 1 ] && [ $(find /opt/rocm* -name "libnvinfer*" -print -quit | wc -l 2>&1) -eq 1 ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you search for tensort in /opt/rocm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

onnxruntime.sh Outdated
-DCMAKE_CXX_FLAGS="$CXXFLAGS -Wno-unknown-warning -Wno-unknown-warning-option -Wno-error=unused-but-set-variable -Wno-error=deprecated" \
-DCMAKE_C_FLAGS="$CFLAGS -Wno-unknown-warning -Wno-unknown-warning-option -Wno-error=unused-but-set-variable -Wno-error=deprecated"
# Check ROCm build conditions
if { [ "$ALIBUILD_O2_FORCE_GPU" -ne 0 ] || [ "$ALIBUILD_ENABLE_HIP" -ne 0 ] || command -v rocminfo >/dev/null 2>&1; } && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it guaranteed that rocminfo is in the path? Perhaps, try rocminfo or /opt/rocm/bin/rocminfo?
Also, perhaps it needs HIP, not rocminfo? So perhaps check for /opt/rocm/bin/hipcc?

onnxruntime.sh Show resolved Hide resolved
onnxruntime.sh Outdated
-Donnxruntime_CUDA_HOME=/usr/local/cuda \
-DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \
-D__HIP_PLATFORM_AMD__=1 \
-DCMAKE_HIP_ARCHITECTURES=gfx906,gfx908 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. You first set CMAKE_HIP_ARCHITECTURES, and then you possibly override it in the next line? Why don't you expand ALIBUILD_O2_OVERRIDE_HIP_ARCHS to the defaults if empty? ${...:-default}?

onnxruntime.sh Outdated
# Check CUDA build conditions
if { [ "$ALIBUILD_O2_FORCE_GPU" -ne 0 ] || [ "$ALIBUILD_ENABLE_CUDA" -ne 0 ] || command -v nvcc >/dev/null 2>&1; } && \
{ [ -z "$DISABLE_GPU" ] || [ "$DISABLE_GPU" -eq 0 ]; }; then
export ORT_CUDA_BUILD=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also set a default for ALIBUILD_O2_OVERRIDE_CUDA_ARCHS to sm_86 or sm_89 architecture for now

@ChSonnabend ChSonnabend marked this pull request as ready for review November 22, 2024 21:43
@ChSonnabend ChSonnabend requested a review from a team as a code owner November 22, 2024 21:43
Copy link
Contributor

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if [ "$ALIBUILD_O2_FORCE_GPU" -eq 1 ]
will not work if the variable is not defined.
You can either do [ "0$FOO" == "01" ] or use the bash [[ syntax. Please try with all variables undefined.

@ChSonnabend ChSonnabend requested a review from a team as a code owner November 28, 2024 13:29
@ChSonnabend
Copy link
Collaborator Author

ChSonnabend commented Dec 2, 2024

macOS failure seem unrelated to this PR. It can be merged from my side if there are no objections (@ktf ).

onnxruntime.sh Show resolved Hide resolved
onnxruntime.sh Show resolved Hide resolved
@ChSonnabend
Copy link
Collaborator Author

PR is ready from my side. GPU build should be disabled on EPN's due to missing ROCm packages and ALMA linux major version mismatch (8 instead of 9).

@singiamtel
Copy link
Collaborator

@pzhristov just FYI this would leave our onnx version outside the tf2onnx compatibility range https://github.com/onnx/tensorflow-onnx?tab=readme-ov-file#tf2onnx---convert-tensorflow-keras-tensorflowjs-and-tflite-models-to-onnx

@ChSonnabend
Copy link
Collaborator Author

@pzhristov just FYI this would leave our onnx version outside the tf2onnx compatibility range https://github.com/onnx/tensorflow-onnx?tab=readme-ov-file#tf2onnx---convert-tensorflow-keras-tensorflowjs-and-tflite-models-to-onnx

Why actually? If you refer to the opset number:I think the ONNX opset number does not necessarily correlate with the ONNX version (I might be wrong here). Does it actually break something for the converter?

@ChSonnabend
Copy link
Collaborator Author

The macOS-arm build seems unrelated. Can this PR be merged? @singiamtel @ktf

o2.sh Outdated
@@ -139,6 +139,12 @@ valid_defaults:
#!/bin/sh
export ROOTSYS=$ROOT_ROOT

source $ONNXRUNTIME_ROOT/etc/ort-init.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be made optional via $ONNXRUNTIME_REVISION.

onnxruntime.sh Outdated
[[ -d /opt/rocm/include/rccl ]] && \
[[ -z "$ORT_ROCM_BUILD" ]] ) && \
([[ -z "$ALMA_LINUX_MAJOR_VERSION" ]] || [[ "$ALMA_LINUX_MAJOR_VERSION" -eq 9 ]]); then
export ORT_ROCM_BUILD=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the linter comments.

@ktf
Copy link
Member

ktf commented Jan 17, 2025

It seems to me that a number of old comments / linter requests have not been taken into account. Please have a look.

If you are ready to integrate, can you change the name to the PR to "ORT GPU build" as well?

I might have other comments as well.

@ChSonnabend ChSonnabend changed the title Draft of ORT GPU build ORT GPU build Jan 17, 2025
onnxruntime.sh Outdated Show resolved Hide resolved
onnxruntime.sh Outdated Show resolved Hide resolved
@ChSonnabend
Copy link
Collaborator Author

ChSonnabend commented Jan 22, 2025

MacOS failure seems unrelated. Can we merge? (@ktf )

@ktf ktf merged commit 54466f4 into alisw:master Jan 22, 2025
11 of 12 checks passed
ktf added a commit that referenced this pull request Jan 22, 2025
@ktf ktf mentioned this pull request Jan 22, 2025
ktf added a commit that referenced this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants