Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CudaError: device-side assert triggered #1

Open
Vinit-source opened this issue Feb 8, 2021 · 5 comments
Open

RuntimeError: CudaError: device-side assert triggered #1

Vinit-source opened this issue Feb 8, 2021 · 5 comments

Comments

@Vinit-source
Copy link

Vinit-source commented Feb 8, 2021

Hi. I tried running the code in this repo on google colab. I get the following CUDAError everytime I run the code.
Screenshot (23)
Screenshot (25)
I have set CUDA_LAUNCH_BLOCKING flag to 1 referring to the stackoverflow solutions but the same traceback is returned. The traceback shows problems in the inference_step. Please help.

This is my code:
from google.colab import drive
from os import path
if not path.exists('/content/drive'):
drive.mount('/content/drive')
!apt-get update
!apt-get install python3.8
!apt-get install python3.8-dev
!wget https://bootstrap.pypa.io/get-pip.py && python3.8 get-pip.py
import sys
sys.path[2] = '/usr/lib/python38.zip'
sys.path[3] = '/usr/lib/python3.8'
sys.path[4] = '/usr/lib/python3.8/lib-dynload'
sys.path[5] = '/usr/local/lib/python3.8/dist-packages'
sys.path[7] ='/usr/local/lib/python3.8/dist-packages/IPython/extensions'
import os
os.environ['CUDA_LAUNCH_BLOCKING']="1"
!python3.8 -m pip install -r requirements.txt #This returns an error -- ignore
!pip install -r /my/project/directory/attend-infer-repeat-pytorch/requirements.txt
#Restart runtime after executing the above code
#Create another code cell
%cd /my/project/directory/attend-infer-repeat-pytorch
#import os
os.environ['CUDA_LAUNCH_BLOCKING']="1"
!CUDA_LAUNCH_BLOCKING=1 python3.8 main.py

Thank you.

@addtt
Copy link
Owner

addtt commented Mar 2, 2021

Hi! Sorry to hear this is not working for you. Can you try using python 3.7 and running locally? It uses CPU if there's no GPU available. Just let it train for half a minute to see if it goes smooth. I just tried it again from a clean conda environment with python 3.7, installed requirements with pip install -r requirements.txt, and it works. I suggest you try with conda too. If it works locally but not on colab, I'm not sure what the problem might be.

Btw I just updated the requirements to add tensorboard which I had forgotten, but for me it also works without tensorboard installed (in that case it doesn't save logs).

@fengxyStar
Copy link

Hi. I met the same problem. I found it is mainly due to that weights of the lstm and the predictor change to NaN. It makes the output become NaN too and results the error.
Through the visualization of tensorboard, I found a sudden change of accuracy appeared in the process of training (as shown below, it is noted that the NaN does not appear immediately after the sudden change). After that, the model continues to train for several iterations and the NaN appears suddenly. However, I have no idea why it happened.

屏幕快照 2021-08-02 下午12 11 04

@addtt
Copy link
Owner

addtt commented Aug 2, 2021

Thanks for looking into this. I never had these problems, but I was using package versions that were around at that time (see requirements.txt). Maybe something changed in the newer versions of PyTorch (or some other package) that make this unstable?

This can typically happen if some activation becomes too large, e.g. when taking the log of something that gets really close to zero. Usually some form of clipping helps (e.g. log(clip(x, min=1e-4)) or log(x + 1e-4)), but I'm still puzzled why I never encountered such problems with this model. It might be useful to save checkpoints every time step (keeping only the last few of them) and then do a post-mortem by loading the last checkpoints and seeing what went wrong. Or logging weights and gradients of all parameters to see exactly where in the network things start to explode.

@fengxyStar
Copy link

Thanks for looking into this. I never had these problems, but I was using package versions that were around at that time (see requirements.txt). Maybe something changed in the newer versions of PyTorch (or some other package) that make this unstable?

This can typically happen if some activation becomes too large, e.g. when taking the log of something that gets really close to zero. Usually some form of clipping helps (e.g. log(clip(x, min=1e-4)) or log(x + 1e-4)), but I'm still puzzled why I never encountered such problems with this model. It might be useful to save checkpoints every time step (keeping only the last few of them) and then do a post-mortem by loading the last checkpoints and seeing what went wrong. Or logging weights and gradients of all parameters to see exactly where in the network things start to explode.

Thank you for your detailed reply! I finally figured out where the NaN results from. At the later stage of training, the value of z_pres_p can reach 1.0. Although the code defines z_pres_p.clamp(min=eps, max=1.0-eps) operation where eps=1e-12 , but the eps is too small and z_pres_p is still equal to 1.0 after the clamp operation (It may be due to the accuracy in GPU floating point calculations. It also explains why you never had the problem ). It makes the derivative of kl_pres become NaN. To solve the problem, just let eps be no less than 1e-7.

@addtt
Copy link
Owner

addtt commented Aug 5, 2021

Nice, thank you! So did this solve the problem? Would you mind sending a PR with the change (linked to this issue)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants