RuntimeError: CudaError: device-side assert triggered #1

Vinit-source · 2021-02-08T05:17:35Z

Hi. I tried running the code in this repo on google colab. I get the following CUDAError everytime I run the code.

I have set CUDA_LAUNCH_BLOCKING flag to 1 referring to the stackoverflow solutions but the same traceback is returned. The traceback shows problems in the inference_step. Please help.

This is my code:
from google.colab import drive
from os import path
if not path.exists('/content/drive'):
drive.mount('/content/drive')
!apt-get update
!apt-get install python3.8
!apt-get install python3.8-dev
!wget https://bootstrap.pypa.io/get-pip.py && python3.8 get-pip.py
import sys
sys.path[2] = '/usr/lib/python38.zip'
sys.path[3] = '/usr/lib/python3.8'
sys.path[4] = '/usr/lib/python3.8/lib-dynload'
sys.path[5] = '/usr/local/lib/python3.8/dist-packages'
sys.path[7] ='/usr/local/lib/python3.8/dist-packages/IPython/extensions'
import os
os.environ['CUDA_LAUNCH_BLOCKING']="1"
!python3.8 -m pip install -r requirements.txt #This returns an error -- ignore
!pip install -r /my/project/directory/attend-infer-repeat-pytorch/requirements.txt
#Restart runtime after executing the above code
#Create another code cell
%cd /my/project/directory/attend-infer-repeat-pytorch
#import os
os.environ['CUDA_LAUNCH_BLOCKING']="1"
!CUDA_LAUNCH_BLOCKING=1 python3.8 main.py

Thank you.

addtt · 2021-03-02T23:00:55Z

Hi! Sorry to hear this is not working for you. Can you try using python 3.7 and running locally? It uses CPU if there's no GPU available. Just let it train for half a minute to see if it goes smooth. I just tried it again from a clean conda environment with python 3.7, installed requirements with pip install -r requirements.txt, and it works. I suggest you try with conda too. If it works locally but not on colab, I'm not sure what the problem might be.

Btw I just updated the requirements to add tensorboard which I had forgotten, but for me it also works without tensorboard installed (in that case it doesn't save logs).

fengxyStar · 2021-08-02T04:12:30Z

Hi. I met the same problem. I found it is mainly due to that weights of the lstm and the predictor change to NaN. It makes the output become NaN too and results the error.
Through the visualization of tensorboard, I found a sudden change of accuracy appeared in the process of training (as shown below, it is noted that the NaN does not appear immediately after the sudden change). After that, the model continues to train for several iterations and the NaN appears suddenly. However, I have no idea why it happened.

addtt · 2021-08-02T09:21:08Z

Thanks for looking into this. I never had these problems, but I was using package versions that were around at that time (see requirements.txt). Maybe something changed in the newer versions of PyTorch (or some other package) that make this unstable?

This can typically happen if some activation becomes too large, e.g. when taking the log of something that gets really close to zero. Usually some form of clipping helps (e.g. log(clip(x, min=1e-4)) or log(x + 1e-4)), but I'm still puzzled why I never encountered such problems with this model. It might be useful to save checkpoints every time step (keeping only the last few of them) and then do a post-mortem by loading the last checkpoints and seeing what went wrong. Or logging weights and gradients of all parameters to see exactly where in the network things start to explode.

fengxyStar · 2021-08-04T10:08:48Z

Thanks for looking into this. I never had these problems, but I was using package versions that were around at that time (see requirements.txt). Maybe something changed in the newer versions of PyTorch (or some other package) that make this unstable?

This can typically happen if some activation becomes too large, e.g. when taking the log of something that gets really close to zero. Usually some form of clipping helps (e.g. log(clip(x, min=1e-4)) or log(x + 1e-4)), but I'm still puzzled why I never encountered such problems with this model. It might be useful to save checkpoints every time step (keeping only the last few of them) and then do a post-mortem by loading the last checkpoints and seeing what went wrong. Or logging weights and gradients of all parameters to see exactly where in the network things start to explode.

Thank you for your detailed reply! I finally figured out where the NaN results from. At the later stage of training, the value of z_pres_p can reach 1.0. Although the code defines z_pres_p.clamp(min=eps, max=1.0-eps) operation where eps=1e-12 , but the eps is too small and z_pres_p is still equal to 1.0 after the clamp operation (It may be due to the accuracy in GPU floating point calculations. It also explains why you never had the problem ). It makes the derivative of kl_pres become NaN. To solve the problem, just let eps be no less than 1e-7.

addtt · 2021-08-05T12:29:19Z

Nice, thank you! So did this solve the problem? Would you mind sending a PR with the change (linked to this issue)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CudaError: device-side assert triggered #1

RuntimeError: CudaError: device-side assert triggered #1

Vinit-source commented Feb 8, 2021 •

edited

Loading

addtt commented Mar 2, 2021

fengxyStar commented Aug 2, 2021

addtt commented Aug 2, 2021

fengxyStar commented Aug 4, 2021

addtt commented Aug 5, 2021

RuntimeError: CudaError: device-side assert triggered #1

RuntimeError: CudaError: device-side assert triggered #1

Comments

Vinit-source commented Feb 8, 2021 • edited Loading

addtt commented Mar 2, 2021

fengxyStar commented Aug 2, 2021

addtt commented Aug 2, 2021

fengxyStar commented Aug 4, 2021

addtt commented Aug 5, 2021

Vinit-source commented Feb 8, 2021 •

edited

Loading