[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

AphidGit · 2024-02-17T18:32:39Z

Problem Description

Running stable diffusion webui, worked when rocm was version 5.7. Version 6.0, updated feb 15, breaks this. While I had the occasional hiccup, lockup or reboot before with v5.7, it was fairly stable and could produce images. Version 6.0 will crash upon trying to load any non-trivial data into the gpu consistently.

It reports the following stack traces to me. Somewhere in between, I can see a, probably from a different thread, runtimeError. When loading multiple models (such as when using Low-Rank adaptations), I get a RuntimeError for each one.

Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/initialize.py", line 147, in load_model
    shared.sd_model  # noqa: B018
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/shared_items.py", line 128, in sd_model
    return modules.sd_models.model_data.get_sd_model()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 531, in get_sd_model
    load_model()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 658, in load_model
    load_model_weights(sd_model, checkpoint_info, state_dict, timer)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 375, in load_model_weights
    model.load_state_dict(state_dict, strict=False)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 223, in <lambda>
    module_load_state_dict = self.replace(torch.nn.Module, 'load_state_dict', lambda *args, **kwargs: load_state_dict(module_load_state_dict, *args, **kwargs))
                                                                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 221, in load_state_dict
    original(module, state_dict, strict=strict)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
    load(self, state_dict)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  [Previous line repeated 1 more time]
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2121, in load
    module._load_from_state_dict(
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 225, in <lambda>
    linear_load_from_state_dict = self.replace(torch.nn.Linear, '_load_from_state_dict', lambda *args, **kwargs: load_from_state_dict(linear_load_from_state_dict, *args, **kwargs))
                                                                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 191, in load_from_state_dict
    module._parameters[name] = torch.nn.parameter.Parameter(torch.zeros_like(param, device=device, dtype=dtype), requires_grad=param.requires_grad)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/_meta_registrations.py", line 4815, in zeros_like
    res.fill_(0)
RuntimeError: HIP error: shared object initialization failed
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.

Exception in thread Thread-2 (load_model):
Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/initialize.py", line 153, in load_model
    devices.first_time_calculation()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/devices.py", line 166, in first_time_calculation
    conv2d(x)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/extensions-builtin/Lora/networks.py", line 501, in network_Conv2d_forward
    return originals.Conv2d_forward(self, input)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 462, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 458, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: shared object initialization failed
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Digging further, I found that using the environment variable AMD_LOG_LEVEL and setting it higher (anything higher than zero was enough, so try env AMD_LOG_LEVEL=1 ) gave me another clue;

:1:hip_code_object.cpp      :616 : 66280489141 us: [pid:379524 tid:0x769a212006c0] Cannot find the function: Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_8_TLDS0_UMLDSA0_UMLDSB0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG8_8_1_WGM8 
:1:hip_module.cpp           :83  : 66280489163 us: [pid:379524 tid:0x769a212006c0] Cannot find the function: Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_8_TLDS0_UMLDSA0_UMLDSB0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG8_8_1_WGM8 for module: 0x1c4f5db0

I edited the code of webui to put a little 'press any key' prompt in, and attached gdb, then made it break at that line. Here's a full backtrace. Involved are the following things:

rocm
blas
hip
pytorch
torchvision
stable-diffusion-webui

gdb) bt
#0  hip::DynCO::getDynFunc (func_name=..., hfunc=<optimized out>, this=0x76732cb4b7d0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_code_object.cpp:616
#1  PlatformState::getDynFunc (
    func_name=0x76732c011850 "Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_"..., hmod=0x76732c0b6ed0, hfunc=<optimized out>, this=0x6550370672d0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_platform.cpp:747
#2  hipModuleGetFunction (hfunc=<optimized out>, hmod=<optimized out>, 
    name=0x76732c011850 "Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_"...) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_module.cpp:82
#3  0x00007675c994165c in ?? () from /opt/rocm/lib/librocblas.so.4
#4  0x00007675c994250c in ?? () from /opt/rocm/lib/librocblas.so.4
#5  0x00007675c994280c in ?? () from /opt/rocm/lib/librocblas.so.4
#6  0x00007675c8f88a6f in ?? () from /opt/rocm/lib/librocblas.so.4
#7  0x00007675c9061ec9 in ?? () from /opt/rocm/lib/librocblas.so.4
#8  0x00007675c905fa3c in ?? () from /opt/rocm/lib/librocblas.so.4
#9  0x00007675c905ac1e in ?? () from /opt/rocm/lib/librocblas.so.4
#10 0x00007675c90586d9 in rocblas_gemm_ex () from /opt/rocm/lib/librocblas.so.4
#11 0x0000767646fbda9a in ?? () from /usr/lib/libtorch_hip.so
#12 0x0000767646fdb254 in ?? () from /usr/lib/libtorch_hip.so
#13 0x0000767647122601 in ?? () from /usr/lib/libtorch_hip.so
#14 0x00007676471226a4 in ?? () from /usr/lib/libtorch_hip.so
#15 0x00007676a8c791cc in at::_ops::addmm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) () from /usr/lib/libtorch_cpu.so
#16 0x00007676aad66f13 in ?? () from /usr/lib/libtorch_cpu.so
#17 0x00007676aad67e46 in ?? () from /usr/lib/libtorch_cpu.so
#18 0x00007676a8cee25b in at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) () from /usr/lib/libtorch_cpu.so
#19 0x00007676a851bd80 in at::native::linear(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&) () from /usr/lib/libtorch_cpu.so
#20 0x00007676a975fb5b in ?? () from /usr/lib/libtorch_cpu.so
#21 0x00007676a8cd56c7 in at::_ops::linear::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&) () from /usr/lib/libtorch_cpu.so
#22 0x00007676b3222be8 in ?? () from /usr/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#23 0x00007676be1fdd41 in cfunction_call (func=0x7674af8af470, args=<optimized out>, kwargs=<optimized out>) at Objects/methodobject.c:542
#24 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x7674af8af470, args=<optimized out>, nargs=3, keywords=0x0) at Objects/call.c:214
#25 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#26 0x00007676be22de9f in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9428, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#27 _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x7673311ff1b0, locals=0x0, func=0x76733198c5e0, tstate=0x65504125ade0) at Python/ceval.c:6434
#28 _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x7673311ff1b0, func=0x76733198c5e0) at Objects/call.c:393
#29 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7673311ff1b0, callable=0x76733198c5e0, tstate=0x65504125ade0) at ./Include/internal/pycore_call.h:92
#30 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:89
#31 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x7673304003c0, callargs=0x7673301e9840, func=0x76734ac3d600, tstate=<optimized out>) at Python/ceval.c:7352
#32 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#33 0x00007676be22de9f in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9308, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#34 _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x7673311ff3e0, locals=0x0, func=0x7674ae114680, tstate=0x65504125ade0) at Python/ceval.c:6434
#35 _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x7673311ff3e0, func=0x7674ae114680) at Objects/call.c:393
#36 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7673311ff3e0, callable=0x7674ae114680, tstate=0x65504125ade0) at ./Include/internal/pycore_call.h:92
#37 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:89
#38 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x76732b538e00, callargs=0x76734ac34d90, func=0x76734ac0a080, tstate=<optimized out>) at Python/ceval.c:7352
#39 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#40 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9280, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#41 _PyEval_Vector (kwnames=<optimized out>, argcount=2, args=0x7673311ff6a0, locals=0x0, func=0x7674ae1145e0, tstate=0x65504125ade0) at Python/ceval.c:6434
#42 _PyFunction_Vectorcall (func=0x7674ae1145e0, stack=0x7673311ff6a0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#43 0x00007676be1e0d97 in _PyObject_FastCallDictTstate (tstate=0x65504125ade0, callable=0x7674ae1145e0, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at Objects/call.c:141
#44 0x00007676be216b3d in _PyObject_Call_Prepend (tstate=0x65504125ade0, callable=0x7674ae1145e0, obj=0x76734ac3d6d0, args=<optimized out>, kwargs=0x0) at Objects/call.c:482
#45 0x00007676be2dba82 in slot_tp_call (self=0x76734ac3d6d0, args=0x76734ac34d30, kwds=0x0) at Objects/typeobject.c:7623
#46 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x76734ac3d6d0, args=<optimized out>, nargs=1, keywords=0x0) at Objects/call.c:214
#47 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#48 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e91f8, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#49 _PyEval_Vector (kwnames=<optimized out>, argcount=0, args=0x7676be574010 <_PyRuntime+58928>, locals=0x0, func=0x767349893880, tstate=0x65504125ade0) at Python/ceval.c:6434
#50 _PyFunction_Vectorcall (func=0x767349893880, stack=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#51 0x00007676be2e2297 in bounded_lru_cache_wrapper (self=0x767349937ed0, args=0x7676be573ff8 <_PyRuntime+58904>, kwds=0x0) at ./Modules/_functoolsmodule.c:1021
#52 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x767349937ed0, args=<optimized out>, nargs=0, keywords=0x0) at Objects/call.c:214
#53 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#54 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9188, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#55 _PyEval_Vector (kwnames=<optimized out>, argcount=0, args=0x7676be574010 <_PyRuntime+58928>, locals=0x0, func=0x76733132b9c0, tstate=0x65504125ade0) at Python/ceval.c:6434
#56 _PyFunction_Vectorcall (func=0x76733132b9c0, stack=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#57 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x767331384e40, callargs=0x7676be573ff8 <_PyRuntime+58904>, func=0x76733132b9c0, tstate=<optimized out>) at Python/ceval.c:7352
#58 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#59 0x00007676be22e583 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9020, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#60 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x7673311ffe28, locals=0x0, func=0x7676bd476ca0, tstate=0x65504125ade0) at Python/ceval.c:6434
#61 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7673311ffe28, func=0x7676bd476ca0) at Objects/call.c:393
#62 _PyObject_VectorcallTstate (tstate=0x65504125ade0, callable=0x7676bd476ca0, args=0x7673311ffe28, nargsf=<optimized out>, kwnames=<optimized out>) at ./Include/internal/pycore_call.h:92
#63 0x00007676be22e070 in method_vectorcall (method=<optimized out>, args=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:67
#64 0x00007676be2f4df8 in thread_run (boot_raw=0x655044198250) at ./Modules/_threadmodule.c:1124
#65 0x00007676be2cc538 in pythread_wrapper (arg=<optimized out>) at Python/thread_pthread.h:241
#66 0x00007676bdea955a in start_thread (arg=<optimized out>) at pthread_create.c:447
#67 0x00007676bdf26a3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Operating System

Arch linux, kernel 6.7.4-arch1-1

CPU

AMD Threadripper 1950X

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

To reproduce;

1* Create a venv. enter it.
2* Install stable diffusion webui, following https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#user-content-install-on-amd-and-arch-linux
3* Download any sd model and place in models folder.
4* either ./webui.sh or python launch.py

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen Threadripper 1950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen Threadripper 1950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3400                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    GPU-<redacted>               
  Marketing Name:          AMD Radeon RX 7900 XTX             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2482                               
  BDFID:                   17152                              
  Internal Node ID:        1                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 528                                
  SDMA engine uCode::      19                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Additional Information

No response

The text was updated successfully, but these errors were encountered:

Kamishirasawa-keine · 2024-03-28T16:55:08Z

Same issue.

alexxu-amd · 2024-04-03T15:49:37Z

Yea, I am able to reproduce this error using the installation step from https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#user-content-install-on-amd-and-arch-linux.

Can you guys try reinstalling torch and torchvision using
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0?

ppanchad-amd · 2024-05-31T19:38:15Z

@AphidGit @Kamishirasawa-keine Please try @alexxu-amd suggestion above. Thanks!

AphidGit · 2024-07-05T18:53:23Z

That suggestion breaks things even more.

RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test

xangelix · 2024-09-05T22:16:25Z

Unfortunately I'm still getting the same error with rocm 5.7 / rocm 6.0, python 3.10, when attempting to use stable-diffusion-webui. Did anyone ever get a solution to this?

7900xtx

glibc version is 2.40
Check TCMalloc: libtcmalloc_minimal.so.4
libtcmalloc_minimal.so.4 is linked with libc.so,execute LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4
Python 3.10.14 (main, Sep  5 2024, 22:06:38) [GCC 14.2.1 20240904]
Version: v1.10.1
Commit hash: 82a973c04367123ae98bd9abdf80d9eda9b910e2
Installing torch and torchvision
Looking in indexes: https://download.pytorch.org/whl/rocm6.0
Collecting torch
  Downloading https://download.pytorch.org/whl/rocm6.0/torch-2.4.1%2Brocm6.0-cp310-cp310-linux_x86_64.whl (2363.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 GB 53.4 MB/s eta 0:00:00
Collecting torchvision
  Downloading https://download.pytorch.org/whl/rocm6.0/torchvision-0.19.1%2Brocm6.0-cp310-cp310-linux_x86_64.whl (65.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.8/65.8 MB 21.0 MB/s eta 0:00:00
Collecting torchaudio
  Downloading https://download.pytorch.org/whl/rocm6.0/torchaudio-2.4.1%2Brocm6.0-cp310-cp310-linux_x86_64.whl (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 13.2 MB/s eta 0:00:00
Collecting filelock (from torch)
  Downloading https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl (11 kB)
Collecting typing-extensions>=4.8.0 (from torch)
  Downloading https://download.pytorch.org/whl/typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting sympy (from torch)
  Downloading https://download.pytorch.org/whl/sympy-1.12-py3-none-any.whl (5.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.7/5.7 MB 28.9 MB/s eta 0:00:00
Collecting networkx (from torch)
  Downloading https://download.pytorch.org/whl/networkx-3.2.1-py3-none-any.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 30.4 MB/s eta 0:00:00
Collecting jinja2 (from torch)
  Downloading https://download.pytorch.org/whl/Jinja2-3.1.3-py3-none-any.whl (133 kB)
Collecting fsspec (from torch)
  Downloading https://download.pytorch.org/whl/fsspec-2024.2.0-py3-none-any.whl (170 kB)
Collecting pytorch-triton-rocm==3.0.0 (from torch)
  Downloading https://download.pytorch.org/whl/pytorch_triton_rocm-3.0.0-cp310-cp310-linux_x86_64.whl (341.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 341.7/341.7 MB 58.9 MB/s eta 0:00:00
Collecting numpy (from torchvision)
  Downloading https://download.pytorch.org/whl/numpy-1.26.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 61.1 MB/s eta 0:00:00
Collecting pillow!=8.3.*,>=5.3.0 (from torchvision)
  Downloading https://download.pytorch.org/whl/pillow-10.2.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 56.8 MB/s eta 0:00:00
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Downloading https://download.pytorch.org/whl/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Collecting mpmath>=0.19 (from sympy->torch)
  Downloading https://download.pytorch.org/whl/mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 248.3 MB/s eta 0:00:00
Installing collected packages: mpmath, typing-extensions, sympy, pillow, numpy, networkx, MarkupSafe, fsspec, filelock, pytorch-triton-rocm, jinja2, torch, torchvision, torchaudio
Successfully installed MarkupSafe-2.1.5 filelock-3.13.1 fsspec-2024.2.0 jinja2-3.1.3 mpmath-1.3.0 networkx-3.2.1 numpy-1.26.3 pillow-10.2.0 pytorch-triton-rocm-3.0.0 sympy-1.12 torch-2.4.1+rocm6.0 torchaudio-2.4.1+rocm6.0 torchvision-0.19.1+rocm6.0 typing-extensions-4.9.0
Installing clip
Installing open_clip
Cloning assets into /home/tux/stable-diffusion-webui/repositories/stable-diffusion-webui-assets...
Cloning into '/home/tux/stable-diffusion-webui/repositories/stable-diffusion-webui-assets'...
Cloning Stable Diffusion into /home/tux/stable-diffusion-webui/repositories/stable-diffusion-stability-ai...
Cloning into '/home/tux/stable-diffusion-webui/repositories/stable-diffusion-stability-ai'...
Cloning Stable Diffusion XL into /home/tux/stable-diffusion-webui/repositories/generative-models...
Cloning into '/home/tux/stable-diffusion-webui/repositories/generative-models'...
Cloning K-diffusion into /home/tux/stable-diffusion-webui/repositories/k-diffusion...
Cloning into '/home/tux/stable-diffusion-webui/repositories/k-diffusion'...
Cloning BLIP into /home/tux/stable-diffusion-webui/repositories/BLIP...
Cloning into '/home/tux/stable-diffusion-webui/repositories/BLIP'...
Installing requirements

---

[automatic] | glibc version is 2.40
[automatic] | Check TCMalloc: libtcmalloc_minimal.so.4
[automatic] | libtcmalloc_minimal.so.4 is linked with libc.so,execute LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4
[automatic] | Python 3.10.14 (main, Sep  5 2024, 10:36:08) [GCC 14.2.1 20240805]
[automatic] | Version: v1.10.1
[automatic] | Commit hash: 82a973c04367123ae98bd9abdf80d9eda9b910e2
[automatic] | Launching Web UI with arguments: 
[automatic] | amdgpu.ids: No such file or directory
[automatic] | amdgpu.ids: No such file or directory
[automatic] | no module 'xformers'. Processing without...
[automatic] | no module 'xformers'. Processing without...
[automatic] | No module 'xformers'. Proceeding without it.
[automatic] | Calculating sha256 for /home/tux/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors: Running on local URL:  http://127.0.0.1:7860
[automatic] | 
[automatic] | To create a public link, set `share=True` in `launch()`.
[automatic] | Startup time: 7.2s (prepare environment: 2.5s, import torch: 2.1s, import gradio: 0.5s, setup paths: 0.8s, other imports: 0.5s, list SD models: 0.1s, load scripts: 0.2s, create ui: 0.3s).
[automatic] | 6ce0161689b3853acaa03779ec93eafe75a02f4ced659bee03f50797806fa2fa
[automatic] | Loading weights [6ce0161689] from /home/tux/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
[automatic] | Creating model from config: /home/tux/stable-diffusion-webui/configs/v1-inference.yaml
[automatic] | /home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[automatic] |   warnings.warn(
[automatic] | loading stable diffusion model: RuntimeError
[automatic] | Traceback (most recent call last):
[automatic] |   File "/home/tux/.pyenv/versions/3.10.14/lib/python3.10/threading.py", line 973, in _bootstrap
[automatic] |     self._bootstrap_inner()
[automatic] |   File "/home/tux/.pyenv/versions/3.10.14/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[automatic] |     self.run()
[automatic] |   File "/home/tux/.pyenv/versions/3.10.14/lib/python3.10/threading.py", line 953, in run
[automatic] |     self._target(*self._args, **self._kwargs)
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/initialize.py", line 149, in load_model
[automatic] |     shared.sd_model  # noqa: B018
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/shared_items.py", line 175, in sd_model
[automatic] |     return modules.sd_models.model_data.get_sd_model()
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_models.py", line 693, in get_sd_model
[automatic] |     load_model()
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_models.py", line 845, in load_model
[automatic] |     load_model_weights(sd_model, checkpoint_info, state_dict, timer)
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_models.py", line 440, in load_model_weights
[automatic] |     model.load_state_dict(state_dict, strict=False)
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 223, in <lambda>
[automatic] |     module_load_state_dict = self.replace(torch.nn.Module, 'load_state_dict', lambda *args, **kwargs: load_state_dict(module_load_state_dict, *args, **kwargs))
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 221, in load_state_dict
[automatic] |     original(module, state_dict, strict=strict)
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2175, in load_state_dict
[automatic] |     load(self, state_dict)
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
[automatic] |     load(child, child_state_dict, child_prefix)  # noqa: F821
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
[automatic] |     load(child, child_state_dict, child_prefix)  # noqa: F821
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
[automatic] |     load(child, child_state_dict, child_prefix)  # noqa: F821
[automatic] |   [Previous line repeated 1 more time]
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2157, in load
[automatic] |     module._load_from_state_dict(
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 225, in <lambda>
[automatic] |     linear_load_from_state_dict = self.replace(torch.nn.Linear, '_load_from_state_dict', lambda *args, **kwargs: load_from_state_dict(linear_load_from_state_dict, *args, **kwargs))
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 191, in load_from_state_dict
[automatic] |     module._parameters[name] = torch.nn.parameter.Parameter(torch.zeros_like(param, device=device, dtype=dtype), requires_grad=param.requires_grad)
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_meta_registrations.py", line 4964, in zeros_like
[automatic] |     res.fill_(0)
[automatic] | RuntimeError: HIP error: shared object initialization failed
[automatic] | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[automatic] | For debugging consider passing AMD_SERIALIZE_KERNEL=3.
[automatic] | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
[automatic] | 
[automatic] | 
[automatic] | 
[automatic] | Stable diffusion model failed to load
[automatic] | Applying attention optimization: Doggettx... done.
[automatic] | ./webui.sh: line 304:   191 Segmentation fault      (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"

xangelix · 2024-09-05T23:01:40Z

Okay, well, restarting fixed the issue for myself. If you haven't tried that I guess definitely do, even if host system libraries haven't changed. It appears that sometimes a previous GPU compute operation doesn't end or close properly, and it generates seemingly random errors until reset fully.

schung-amd · 2024-10-01T19:13:45Z

I can't reproduce this on a fresh install of Arch with ROCm 6.0.2 (from pacman) on a 7900XTX, but I did have to modify the installation process slightly as the instructions in https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#user-content-install-on-amd-and-arch-linux weren't working out of the box. The changes I made:

Specifically installed python 3.10 and used it to create the venv
Manually downgraded torch packages, as the requirements pulled in torch packages that weren't built for ROCm:

pip3.10 uninstall torch torchvision
pip3.10 install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm6.0

I am also using the command-line args --no-half and --no-half-vae as these are required to avoid crashes on many AMD GPUs; these may or may not be necessary on your system.

@AphidGit @Kamishirasawa-keine Are you still experiencing this issue? If so, can you try switching your torch packages for these?
@xangelix Was your issue a one-time thing, or do you have to restart periodically to fix this error?

ppanchad-amd added the Under Investigation label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

AphidGit commented Feb 17, 2024

Kamishirasawa-keine commented Mar 28, 2024

alexxu-amd commented Apr 3, 2024

ppanchad-amd commented May 31, 2024

AphidGit commented Jul 5, 2024

xangelix commented Sep 5, 2024 •

edited

Loading

xangelix commented Sep 5, 2024

schung-amd commented Oct 1, 2024 •

edited

Loading

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

Comments

AphidGit commented Feb 17, 2024

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Kamishirasawa-keine commented Mar 28, 2024

alexxu-amd commented Apr 3, 2024

ppanchad-amd commented May 31, 2024

AphidGit commented Jul 5, 2024

xangelix commented Sep 5, 2024 • edited Loading

xangelix commented Sep 5, 2024

schung-amd commented Oct 1, 2024 • edited Loading

xangelix commented Sep 5, 2024 •

edited

Loading

schung-amd commented Oct 1, 2024 •

edited

Loading