-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CudaError: device-side assert triggered #1
Comments
Hi! Sorry to hear this is not working for you. Can you try using python 3.7 and running locally? It uses CPU if there's no GPU available. Just let it train for half a minute to see if it goes smooth. I just tried it again from a clean conda environment with python 3.7, installed requirements with pip install -r requirements.txt, and it works. I suggest you try with conda too. If it works locally but not on colab, I'm not sure what the problem might be. Btw I just updated the requirements to add tensorboard which I had forgotten, but for me it also works without tensorboard installed (in that case it doesn't save logs). |
Hi. I met the same problem. I found it is mainly due to that weights of the lstm and the predictor change to NaN. It makes the output become NaN too and results the error. |
Thanks for looking into this. I never had these problems, but I was using package versions that were around at that time (see requirements.txt). Maybe something changed in the newer versions of PyTorch (or some other package) that make this unstable? This can typically happen if some activation becomes too large, e.g. when taking the log of something that gets really close to zero. Usually some form of clipping helps (e.g. |
Thank you for your detailed reply! I finally figured out where the NaN results from. At the later stage of training, the value of |
Nice, thank you! So did this solve the problem? Would you mind sending a PR with the change (linked to this issue)? |
Hi. I tried running the code in this repo on google colab. I get the following CUDAError everytime I run the code.
I have set CUDA_LAUNCH_BLOCKING flag to 1 referring to the stackoverflow solutions but the same traceback is returned. The traceback shows problems in the inference_step. Please help.
This is my code:
from google.colab import drive
from os import path
if not path.exists('/content/drive'):
drive.mount('/content/drive')
!apt-get update
!apt-get install python3.8
!apt-get install python3.8-dev
!wget https://bootstrap.pypa.io/get-pip.py && python3.8 get-pip.py
import sys
sys.path[2] = '/usr/lib/python38.zip'
sys.path[3] = '/usr/lib/python3.8'
sys.path[4] = '/usr/lib/python3.8/lib-dynload'
sys.path[5] = '/usr/local/lib/python3.8/dist-packages'
sys.path[7] ='/usr/local/lib/python3.8/dist-packages/IPython/extensions'
import os
os.environ['CUDA_LAUNCH_BLOCKING']="1"
!python3.8 -m pip install -r requirements.txt #This returns an error -- ignore
!pip install -r /my/project/directory/attend-infer-repeat-pytorch/requirements.txt
#Restart runtime after executing the above code
#Create another code cell
%cd /my/project/directory/attend-infer-repeat-pytorch
#import os
os.environ['CUDA_LAUNCH_BLOCKING']="1"
!CUDA_LAUNCH_BLOCKING=1 python3.8 main.py
Thank you.
The text was updated successfully, but these errors were encountered: