Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packaged RF2-linux.yml pins pytorch-cuda=11.7, may lead to issues with CUDA version #25

Open
matspunt opened this issue Nov 14, 2023 · 8 comments

Comments

@matspunt
Copy link

Hi,

To users: if RF2 defaults to CPU and upon running torch.cuda.is_available() you obtain False, read below.

Be careful when building your conda environment that the CUDA version that is found (which nvcc) in the RF2 conda environment is compatible with the pytorch-cuda version in the environment. I.e. if system CUDA is used, it cannot be greater than >11.7 (see nvidia-smi). If using Python CUDA package is used, ensure cudatoolkit version in your environment matches 11.7 . Default behaviour for conda is to install the latest version cudatoolkit-12.2, which leads to the PyTorch issue.

To developers: perhaps a dependency on cudatoolkit=11.7 or cudatoolkit-dev=11.7 can be added to the environment?

Note: I have used CUDA 12.0 succesfully (with upgraded pytorch-cuda) and saw no difference in the performance or output of RoseTTAFold2 but I can't comment in detail on that. 11.7 works fine too.

Cheers,

Mats

@debadutta-patra
Copy link

Hi @matspunt,
The issue you mentioned is not because of which nvcc is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetch pytorch=2.1.1 which is not compatable with pytorch-cuda=11.7 (as listed in the yml file). A quick fix is to change pytorch-cuda=11.7 to pytorch-cuda=11.8 which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly.

Hope this clarifies the issue.

Debadutta

lloydtripp added a commit to lloydtripp/RoseTTAFold2 that referenced this issue Feb 9, 2024
Restricting pyTorch to 2.1.1 and updating pytorch-cuda to 11.8.

Per instructions here: uw-ipd#25

Motivating issue: 
Traceback (most recent call last):
  File "/storage1/fs1/ghaller/Active/lloydt/LT2_Protein-Modeling/RosettaFold2/network/predict.py", line 493, in <module>
    pred.predict(
  File "/storage1/fs1/ghaller/Active/lloydt/LT2_Protein-Modeling/RosettaFold2/network/predict.py", line 316, in predict
    torch.cuda.reset_peak_memory_stats()
  File "/opt/conda/envs/RF2/lib/python3.10/site-packages/torch/cuda/memory.py", line 307, in reset_peak_memory_stats
    return torch._C._cuda_resetPeakMemoryStats(device)
AttributeError: module 'torch._C' has no attribute '_cuda_resetPeakMemoryStats'
@lloydtripp
Copy link

This was very helpful to me in my HPC environment (RIS WUSTL)!

@stianale
Copy link

Hi @matspunt, The issue you mentioned is not because of which nvcc is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetch pytorch=2.1.1 which is not compatable with pytorch-cuda=11.7 (as listed in the yml file). A quick fix is to change pytorch-cuda=11.7 to pytorch-cuda=11.8 which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly.

Hope this clarifies the issue.

Debadutta

Wrong. The current yml does not work with the actual RF2 code in this repo.

@debadutta-patra
Copy link

Hi @matspunt, The issue you mentioned is not because of which nvcc is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetch pytorch=2.1.1 which is not compatable with pytorch-cuda=11.7 (as listed in the yml file). A quick fix is to change pytorch-cuda=11.7 to pytorch-cuda=11.8 which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly.
Hope this clarifies the issue.
Debadutta

Wrong. The current yml does not work with the actual RF2 code in this repo.

Hey @stianale, as I mentioned in my comment you need to change pytorch-cuda=11.7 to pytorch-cuda=11.8 in the yml file or explicitly mention the last pytorch version to the one that supports pytorch-cuda=11.7. For an easier time you can copy the yml from @lloydtripp RosettaFold2 repository.

@stianale
Copy link

Hi @matspunt, The issue you mentioned is not because of which nvcc is installed, but because the yml file doesn't mention the version of pytorch to look for. As of now conda will try to fetch pytorch=2.1.1 which is not compatable with pytorch-cuda=11.7 (as listed in the yml file). A quick fix is to change pytorch-cuda=11.7 to pytorch-cuda=11.8 which is supported by current release of pytorch. Your nvcc installation do not need to exactly match the pytorch-cuda version, infact nvcc doesn't even need to the in the env variable for it to work. You can use the instructions on pytorch.org to set it up properly.
Hope this clarifies the issue.
Debadutta

Wrong. The current yml does not work with the actual RF2 code in this repo.

Hey @stianale, as I mentioned in my comment you need to change pytorch-cuda=11.7 to pytorch-cuda=11.8 in the yml file or explicitly mention the last pytorch version to the one that supports pytorch-cuda=11.7. For an easier time you can copy the yml from @lloydtripp RosettaFold2 repository.

For me that yields the following errors:

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: \ 
SafetyError: The package for pytorch located at /home/stian/miniconda3/pkgs/pytorch-2.1.1-py3.10_cuda11.8_cudnn8.7.0_0
appears to be corrupted. The path 'lib/python3.10/site-packages/torch/cuda/memory.py'
has an incorrect size.
  reported size: 34961 bytes
  actual size: 34955 bytes

ClobberError: This transaction has incompatible packages due to a shared path.
  packages: nvidia/linux-64::cuda-cupti-11.8.87-0, nvidia/linux-64::cuda-nvtx-11.8.86-0
  path: 'LICENSE'


ClobberError: This transaction has incompatible packages due to a shared path.
  packages: defaults/linux-64::intel-openmp-2023.1.0-hdb19cb5_46306, defaults/linux-64::llvm-openmp-14.0.6-h9e868ea_0
  path: 'lib/libiomp5.so'


ClobberError: This transaction has incompatible packages due to a shared path.
  packages: defaults/linux-64::intel-openmp-2023.1.0-hdb19cb5_46306, defaults/linux-64::llvm-openmp-14.0.6-h9e868ea_0
  path: 'lib/libomptarget.so'

@stianale
Copy link

stianale commented Feb 14, 2024

I got around those errors, but now the same errors that appeared with the old yml file still arise with the new one:

Running on CPU
Traceback (most recent call last):
File "/media/stian/hgst6tb/OneDrive/DUS/PhD/All_Neis/Representative_genomes/RoseTTAFold2/network/predict.py", line 493, in
pred.predict(
File "/media/stian/hgst6tb/OneDrive/DUS/PhD/All_Neis/Representative_genomes/RoseTTAFold2/network/predict.py", line 316, in predict
torch.cuda.reset_peak_memory_stats()
File "/home/stian/miniconda3/envs/RF2/lib/python3.10/site-packages/torch/cuda/memory.py", line 307, in reset_peak_memory_stats
return torch._C._cuda_resetPeakMemoryStats(device)
RuntimeError: invalid argument to reset_peak_memory_stats

@stianale
Copy link

The Rosettafold repos are train wrecks as of now, with recipes not being close to working with the code provided... Similar, although not identical issues are faced with the RF2NA software, and it feels as if it is up to to the users themselves to figure a way out of the incompatabilities.

@austinweigle
Copy link

austinweigle commented Aug 6, 2024

@stianale, I thought I would add to this thread. I was able to get RF2 to install on today, August 7th, 2024. I am using a WSL CUDA install of cuda_11.8.r11.8/compiler.31833905_0.

First, I edited @lloydtripp 's yml file to read:

name: RF2
channels:
  - pytorch
  - nvidia
  - defaults
  - conda-forge
dependencies:
  - python=3.10
  - pip
  - cudatoolkit=11.8
  - pytorch=2.1.1
  - pytorch-cuda=11.8
  - dglteam/label/cu117::dgl
  - pyg::pyg
  - bioconda::hhsuite
  - pandas=2.2.0

That way, we can have the needed cudatoolkit already installed before we reinstall pytorch. To my understanding, the pytorch error with respect to cuda was mainly because cuda does not appear available given the yml-directed install of pytorch. Additionally, parts of pytorch that were used to make RF2 functional are already deprecated. Installing RF2 will likely continue to be a serious difficulty. I recommend looking into old forums/github posts, or even looking at the backend of Google colab notebooks. Those notebooks have to perform fresh installs of software upon every callable instance. This may provide some clues. Anyways, here are the steps in order that I took to have success:

Then, I did the following steps:
STEP 1. conda install ipython
STEP 2. conda uninstall pytorch
STEP 3. conda uninstall pytorch-cuda
STEP 4. Get the correct pytorch install command from the pytorch website:
/PATH/TO/miniconda3/envs/RF2/bin/pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu118
STEP 5. Test that cuda is available in ipython:

import torch 
print(torch.cuda.is_available()) # should read "True"

STEP 6. /PATH/TO/miniconda3/envs/RF2/bin/pip install torchdata
STEP 7. /PATH/TO/miniconda3/envs/RF2/bin/pip install pydantic
STEP 8. Download the correct dgl pip whl corresponding to pytorch 2.1.1 from https://data.dgl.ai/wheels/cu118/repo.html. In the future, this may need to be changed, so you can just look at https://data.dgl.ai/wheels/repo.html. Then, with the whl downloaded: /PATH/TO/miniconda3/envs/RF2/bin/pip install dgl-2.1.0+cu118-cp310-cp310-manylinux1_x86_64.whl
STEP 9. Now you need to create this file: /PATH/TO/miniconda3/envs/RF2/lib/python3.10/site-packages/torch/utils/_import_utils.py. You can copy the code from this repo [LINK]
STEP 10. Now install the transformer from the RoseTTAFold2/SE3Transformer directory:
/PATH/TO/miniconda3/envs/RF2/bin/pip install --no-cache-dir -r requirements.txt
python setup.py install

Lastly, when actually running the predictions I had to either export MKL_NUM_THREADS=1 in my bash script that executes predict.py. Alternatively, you could specify this in python using import mkl and mkl.set_num_threads(1).

I can confirm that this has worked for me. Again, while I pose a solution, there may be some underlying difficulties that vary based on your computing environment. But overall, the main issue is that the pytorch installation that is directed from the yml file does not natively read your cuda library. This thread has done a good job identifying the specific cuda and pytorch versions that are needed. But there may (likely) come a time where the default pulls for software will grab the wrong dependencies and mess everything up. Here I went directly to the pytorch website for the installation command, and then recreated deprecated files that are imported in RF2's ./network/predict.py function

I think we can close this ticket. Hope this helps,
Austin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants