Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker gpu mode not working #19

Open
7Koris opened this issue Oct 28, 2024 · 3 comments
Open

Docker gpu mode not working #19

7Koris opened this issue Oct 28, 2024 · 3 comments

Comments

@7Koris
Copy link

7Koris commented Oct 28, 2024

I tested on a server with an A30 GPU and a laptop with an RTX 3060.
I believe I followed all steps in the setup guide.

docker run -t --rm --gpus all -v  /home/koris/BERTax/in:/in/ fkre/bertax:latest /in/fungi1000.fa
2024-10-28 02:51:58.592252: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2024-10-28 02:52:01.872815: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2024-10-28 02:52:01.874382: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2024-10-28 02:52:02.129207: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-10-28 02:52:02.129421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3060 Laptop GPU computeCapability: 8.6
coreClock: 1.425GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s
2024-10-28 02:52:02.129502: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2024-10-28 02:52:02.198718: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2024-10-28 02:52:02.198855: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2024-10-28 02:52:02.233217: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2024-10-28 02:52:02.240395: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2024-10-28 02:52:02.291158: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2024-10-28 02:52:02.307438: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2024-10-28 02:52:02.400190: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2024-10-28 02:52:02.400936: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-10-28 02:52:02.401359: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-10-28 02:52:02.401504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2024-10-28 02:52:02.402187: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-28 02:52:02.410838: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-10-28 02:52:02.411007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3060 Laptop GPU computeCapability: 8.6
coreClock: 1.425GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s
2024-10-28 02:52:02.411152: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2024-10-28 02:52:02.411276: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2024-10-28 02:52:02.411343: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2024-10-28 02:52:02.411423: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2024-10-28 02:52:02.411487: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2024-10-28 02:52:02.411523: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2024-10-28 02:52:02.411581: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2024-10-28 02:52:02.411616: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2024-10-28 02:52:02.412251: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-10-28 02:52:02.412716: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-10-28 02:52:02.412760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2024-10-28 02:52:02.413368: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2024-10-28 03:00:04.387965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2024-10-28 03:00:04.388080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2024-10-28 03:00:04.388143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2024-10-28 03:00:04.389542: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-10-28 03:00:04.389596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1489] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2024-10-28 03:00:04.390013: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-10-28 03:00:04.390372: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-10-28 03:00:04.390625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4678 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
2024-10-28 03:00:04.392831: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
WARNING:tensorflow:AutoGraph could not transform <bound method PositionEmbedding.call of <keras_pos_embd.pos_embd.PositionEmbedding object at 0x7f4f0816b5b0>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method MultiHeadAttention.call of <keras_multi_head.multi_head_attention.MultiHeadAttention object at 0x7f4f0816b760>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method ScaledDotProductAttention.call of <keras_self_attention.scaled_dot_attention.ScaledDotProductAttention object at 0x7f4ea8136f10>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method Extract.call of <keras_bert.layers.extract.Extract object at 0x7f4f08093820>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
2024-10-28 03:00:09.215262: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2024-10-28 03:00:09.219918: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3193910000 Hz
2024-10-28 03:00:12.537181: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2024-10-28 03:02:12.883371: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cwise_op_gpu_base.cc:89 : Internal: Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
Traceback (most recent call last):
  File "/opt/conda/bin/bertax", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.9/site-packages/bertax/bertax.py", line 112, in main
    preds = model.predict(x, verbose=int(args.verbose), batch_size=args.batch_size)
  File "/opt/conda/lib/python3.9/site-packages/tensorflow/python/keras/engine/training.py", line 1629, in predict
    tmp_batch_outputs = self.predict_function(iterator)
  File "/opt/conda/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 894, in _call
    return self._concrete_stateful_fn._call_flat(
  File "/opt/conda/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/opt/conda/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 555, in call
    outputs = execute.execute(
  File "/opt/conda/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError:  Failed to load in-memory CUBIN: CUDA_ERROR_NO_BINARY_FOR_GPU: no kernel image is available for execution on the device
         [[node model/Encoder-1-FeedForward/Tanh (defined at /lib/python3.9/site-packages/keras_transformer/gelu.py:11) ]] [Op:__inference_predict_function_12441]

Function call stack:
predict_function
@7Koris
Copy link
Author

7Koris commented Oct 28, 2024

Of note, the server doesn't have the NUMA notes, but the output was identical otherwise (this log is from a WSL instance).

@flomock
Copy link
Collaborator

flomock commented Oct 31, 2024

Hello,
sorry for the inconvenience. I guess you have an incompatible cDNN and/or Cuda version with the tensorflow_gpu version installed in the container.
See the list in the following stackoverflow page for more details
https://stackoverflow.com/questions/75789104/cubin-cuda-error-no-binary-for-gpu-error-while-running-attention-layer-with-bid
Please try to identify the tensorflow_gpu in the container, find and install the compatible version and please let us know if this fixed the issue. :)

PS: As far as I remember (at least while training) we needed a GPU with at least 11GB VRAM to run bertax. So I would try the changes discussed above on the A30 first if possible. :)

@7Koris
Copy link
Author

7Koris commented Nov 5, 2024

Hi! Thank you for the response.

First, I am spinning up the docker container and setting the entrypoint to bash
docker run --gpus all -it --rm --name bertyfix --entrypoint bash fkre/bertax:latest

From the container I confirmed I was on debian 11 x86_64.

Then I check the tensorflow version

(base) root@66819c1d89d9:/# python3 -c "import tensorflow as tf; print(tf.__version__)"
2024-11-05 04:08:13.850100: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2.4.1

This confirms that I should need CUDnn 8.0 and CUDA 11.0

I have tried to install both by manual means and using Conda, however I keep getting similar output as the above logs (with some variation depending on the version of CUDA, [I tested up to 11.3]. I could not successfully install CUDA by manual means.

Here are things I've tried:

Conda:

conda install cuda -c nvidia/label/cuda-11.3.0 -c nvidia/label/cuda-11.3.1

conda install https://anaconda.org/nvidia/cudatoolkit/11.0.221/download/linux-64/cudatoolkit-11.0.221-h6bb024c_0.tar.bz2
conda install https://anaconda.org/conda-forge/cudnn/8.0.5.39/download/linux-64/cudnn-8.0.5.39-hc0a50b0_1.tar.bz2

Manual:

The CUDA install page doesn't provide a setup for debian 11 until 11.5, so I was attempting to install CUDA 11.5

Before installing CUDA, I setup add-apt-repository

apt-get install software-properties-common
apt update

Then installing gnupg2 apt-get install gnupg2.

Next, I followed the network install instructions for my platform and architecture here.
There was no pub key available, so I went into the sources list with apt edit-sources to manually set the nvidia url to trusted.

There's a snag at this point:

Errors were encountered while processing:
 /tmp/apt-dpkg-install-Mcd51K/076-nvidia-persistenced_560.35.03-1_amd64.deb
 /tmp/apt-dpkg-install-Mcd51K/211-nvidia-cuda-mps_560.35.03-1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

So I switched to the local runfile

wget https://developer.download.nvidia.com/compute/cuda/11.5.0/local_installers/cuda_11.5.0_495.29.05_linux.run
sh cuda_11.5.0_495.29.05_linux.run

Which also fails.

I have hit a wall and am unsure how to proceed. In the meantime I'll keep trying configurations.

Thank you and kind regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants