Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

Open
AphidGit opened this issue Feb 17, 2024 · 7 comments
Open

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

AphidGit opened this issue Feb 17, 2024 · 7 comments

Comments

@AphidGit
Copy link

Problem Description

Running stable diffusion webui, worked when rocm was version 5.7. Version 6.0, updated feb 15, breaks this. While I had the occasional hiccup, lockup or reboot before with v5.7, it was fairly stable and could produce images. Version 6.0 will crash upon trying to load any non-trivial data into the gpu consistently.

It reports the following stack traces to me. Somewhere in between, I can see a, probably from a different thread, runtimeError. When loading multiple models (such as when using Low-Rank adaptations), I get a RuntimeError for each one.

Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/initialize.py", line 147, in load_model
    shared.sd_model  # noqa: B018
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/shared_items.py", line 128, in sd_model
    return modules.sd_models.model_data.get_sd_model()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 531, in get_sd_model
    load_model()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 658, in load_model
    load_model_weights(sd_model, checkpoint_info, state_dict, timer)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 375, in load_model_weights
    model.load_state_dict(state_dict, strict=False)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 223, in <lambda>
    module_load_state_dict = self.replace(torch.nn.Module, 'load_state_dict', lambda *args, **kwargs: load_state_dict(module_load_state_dict, *args, **kwargs))
                                                                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 221, in load_state_dict
    original(module, state_dict, strict=strict)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
    load(self, state_dict)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  [Previous line repeated 1 more time]
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2121, in load
    module._load_from_state_dict(
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 225, in <lambda>
    linear_load_from_state_dict = self.replace(torch.nn.Linear, '_load_from_state_dict', lambda *args, **kwargs: load_from_state_dict(linear_load_from_state_dict, *args, **kwargs))
                                                                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 191, in load_from_state_dict
    module._parameters[name] = torch.nn.parameter.Parameter(torch.zeros_like(param, device=device, dtype=dtype), requires_grad=param.requires_grad)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/_meta_registrations.py", line 4815, in zeros_like
    res.fill_(0)
RuntimeError: HIP error: shared object initialization failed
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.

Exception in thread Thread-2 (load_model):
Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/initialize.py", line 153, in load_model
    devices.first_time_calculation()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/devices.py", line 166, in first_time_calculation
    conv2d(x)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/extensions-builtin/Lora/networks.py", line 501, in network_Conv2d_forward
    return originals.Conv2d_forward(self, input)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 462, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 458, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: shared object initialization failed
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Digging further, I found that using the environment variable AMD_LOG_LEVEL and setting it higher (anything higher than zero was enough, so try env AMD_LOG_LEVEL=1 ) gave me another clue;

:1:hip_code_object.cpp      :616 : 66280489141 us: [pid:379524 tid:0x769a212006c0] Cannot find the function: Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_8_TLDS0_UMLDSA0_UMLDSB0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG8_8_1_WGM8 
:1:hip_module.cpp           :83  : 66280489163 us: [pid:379524 tid:0x769a212006c0] Cannot find the function: Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_8_TLDS0_UMLDSA0_UMLDSB0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG8_8_1_WGM8 for module: 0x1c4f5db0

I edited the code of webui to put a little 'press any key' prompt in, and attached gdb, then made it break at that line. Here's a full backtrace. Involved are the following things:

  • rocm
  • blas
  • hip
  • pytorch
  • torchvision
  • stable-diffusion-webui
gdb) bt
#0  hip::DynCO::getDynFunc (func_name=..., hfunc=<optimized out>, this=0x76732cb4b7d0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_code_object.cpp:616
#1  PlatformState::getDynFunc (
    func_name=0x76732c011850 "Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_"..., hmod=0x76732c0b6ed0, hfunc=<optimized out>, this=0x6550370672d0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_platform.cpp:747
#2  hipModuleGetFunction (hfunc=<optimized out>, hmod=<optimized out>, 
    name=0x76732c011850 "Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_"...) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_module.cpp:82
#3  0x00007675c994165c in ?? () from /opt/rocm/lib/librocblas.so.4
#4  0x00007675c994250c in ?? () from /opt/rocm/lib/librocblas.so.4
#5  0x00007675c994280c in ?? () from /opt/rocm/lib/librocblas.so.4
#6  0x00007675c8f88a6f in ?? () from /opt/rocm/lib/librocblas.so.4
#7  0x00007675c9061ec9 in ?? () from /opt/rocm/lib/librocblas.so.4
#8  0x00007675c905fa3c in ?? () from /opt/rocm/lib/librocblas.so.4
#9  0x00007675c905ac1e in ?? () from /opt/rocm/lib/librocblas.so.4
#10 0x00007675c90586d9 in rocblas_gemm_ex () from /opt/rocm/lib/librocblas.so.4
#11 0x0000767646fbda9a in ?? () from /usr/lib/libtorch_hip.so
#12 0x0000767646fdb254 in ?? () from /usr/lib/libtorch_hip.so
#13 0x0000767647122601 in ?? () from /usr/lib/libtorch_hip.so
#14 0x00007676471226a4 in ?? () from /usr/lib/libtorch_hip.so
#15 0x00007676a8c791cc in at::_ops::addmm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) () from /usr/lib/libtorch_cpu.so
#16 0x00007676aad66f13 in ?? () from /usr/lib/libtorch_cpu.so
#17 0x00007676aad67e46 in ?? () from /usr/lib/libtorch_cpu.so
#18 0x00007676a8cee25b in at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) () from /usr/lib/libtorch_cpu.so
#19 0x00007676a851bd80 in at::native::linear(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&) () from /usr/lib/libtorch_cpu.so
#20 0x00007676a975fb5b in ?? () from /usr/lib/libtorch_cpu.so
#21 0x00007676a8cd56c7 in at::_ops::linear::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&) () from /usr/lib/libtorch_cpu.so
#22 0x00007676b3222be8 in ?? () from /usr/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#23 0x00007676be1fdd41 in cfunction_call (func=0x7674af8af470, args=<optimized out>, kwargs=<optimized out>) at Objects/methodobject.c:542
#24 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x7674af8af470, args=<optimized out>, nargs=3, keywords=0x0) at Objects/call.c:214
#25 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#26 0x00007676be22de9f in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9428, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#27 _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x7673311ff1b0, locals=0x0, func=0x76733198c5e0, tstate=0x65504125ade0) at Python/ceval.c:6434
#28 _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x7673311ff1b0, func=0x76733198c5e0) at Objects/call.c:393
#29 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7673311ff1b0, callable=0x76733198c5e0, tstate=0x65504125ade0) at ./Include/internal/pycore_call.h:92
#30 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:89
#31 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x7673304003c0, callargs=0x7673301e9840, func=0x76734ac3d600, tstate=<optimized out>) at Python/ceval.c:7352
#32 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#33 0x00007676be22de9f in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9308, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#34 _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x7673311ff3e0, locals=0x0, func=0x7674ae114680, tstate=0x65504125ade0) at Python/ceval.c:6434
#35 _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x7673311ff3e0, func=0x7674ae114680) at Objects/call.c:393
#36 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7673311ff3e0, callable=0x7674ae114680, tstate=0x65504125ade0) at ./Include/internal/pycore_call.h:92
#37 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:89
#38 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x76732b538e00, callargs=0x76734ac34d90, func=0x76734ac0a080, tstate=<optimized out>) at Python/ceval.c:7352
#39 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#40 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9280, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#41 _PyEval_Vector (kwnames=<optimized out>, argcount=2, args=0x7673311ff6a0, locals=0x0, func=0x7674ae1145e0, tstate=0x65504125ade0) at Python/ceval.c:6434
#42 _PyFunction_Vectorcall (func=0x7674ae1145e0, stack=0x7673311ff6a0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#43 0x00007676be1e0d97 in _PyObject_FastCallDictTstate (tstate=0x65504125ade0, callable=0x7674ae1145e0, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at Objects/call.c:141
#44 0x00007676be216b3d in _PyObject_Call_Prepend (tstate=0x65504125ade0, callable=0x7674ae1145e0, obj=0x76734ac3d6d0, args=<optimized out>, kwargs=0x0) at Objects/call.c:482
#45 0x00007676be2dba82 in slot_tp_call (self=0x76734ac3d6d0, args=0x76734ac34d30, kwds=0x0) at Objects/typeobject.c:7623
#46 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x76734ac3d6d0, args=<optimized out>, nargs=1, keywords=0x0) at Objects/call.c:214
#47 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#48 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e91f8, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#49 _PyEval_Vector (kwnames=<optimized out>, argcount=0, args=0x7676be574010 <_PyRuntime+58928>, locals=0x0, func=0x767349893880, tstate=0x65504125ade0) at Python/ceval.c:6434
#50 _PyFunction_Vectorcall (func=0x767349893880, stack=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#51 0x00007676be2e2297 in bounded_lru_cache_wrapper (self=0x767349937ed0, args=0x7676be573ff8 <_PyRuntime+58904>, kwds=0x0) at ./Modules/_functoolsmodule.c:1021
#52 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x767349937ed0, args=<optimized out>, nargs=0, keywords=0x0) at Objects/call.c:214
#53 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#54 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9188, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#55 _PyEval_Vector (kwnames=<optimized out>, argcount=0, args=0x7676be574010 <_PyRuntime+58928>, locals=0x0, func=0x76733132b9c0, tstate=0x65504125ade0) at Python/ceval.c:6434
#56 _PyFunction_Vectorcall (func=0x76733132b9c0, stack=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#57 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x767331384e40, callargs=0x7676be573ff8 <_PyRuntime+58904>, func=0x76733132b9c0, tstate=<optimized out>) at Python/ceval.c:7352
#58 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#59 0x00007676be22e583 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9020, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#60 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x7673311ffe28, locals=0x0, func=0x7676bd476ca0, tstate=0x65504125ade0) at Python/ceval.c:6434
#61 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7673311ffe28, func=0x7676bd476ca0) at Objects/call.c:393
#62 _PyObject_VectorcallTstate (tstate=0x65504125ade0, callable=0x7676bd476ca0, args=0x7673311ffe28, nargsf=<optimized out>, kwnames=<optimized out>) at ./Include/internal/pycore_call.h:92
#63 0x00007676be22e070 in method_vectorcall (method=<optimized out>, args=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:67
#64 0x00007676be2f4df8 in thread_run (boot_raw=0x655044198250) at ./Modules/_threadmodule.c:1124
#65 0x00007676be2cc538 in pythread_wrapper (arg=<optimized out>) at Python/thread_pthread.h:241
#66 0x00007676bdea955a in start_thread (arg=<optimized out>) at pthread_create.c:447
#67 0x00007676bdf26a3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Operating System

Arch linux, kernel 6.7.4-arch1-1

CPU

AMD Threadripper 1950X

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

To reproduce;

1* Create a venv. enter it.
2* Install stable diffusion webui, following https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#user-content-install-on-amd-and-arch-linux
3* Download any sd model and place in models folder.
4* either ./webui.sh or python launch.py

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen Threadripper 1950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen Threadripper 1950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3400                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    GPU-<redacted>               
  Marketing Name:          AMD Radeon RX 7900 XTX             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2482                               
  BDFID:                   17152                              
  Internal Node ID:        1                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 528                                
  SDMA engine uCode::      19                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Additional Information

No response

@Kamishirasawa-keine
Copy link

Same issue.

@alexxu-amd
Copy link

Yea, I am able to reproduce this error using the installation step from https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#user-content-install-on-amd-and-arch-linux.

Can you guys try reinstalling torch and torchvision using
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0?

@ppanchad-amd
Copy link

@AphidGit @Kamishirasawa-keine Please try @alexxu-amd suggestion above. Thanks!

@AphidGit
Copy link
Author

AphidGit commented Jul 5, 2024

That suggestion breaks things even more.

RuntimeError: Torch is not able to use GPU; add --skip-torch-cuda-test

@xangelix
Copy link

xangelix commented Sep 5, 2024

Unfortunately I'm still getting the same error with rocm 5.7 / rocm 6.0, python 3.10, when attempting to use stable-diffusion-webui. Did anyone ever get a solution to this?

7900xtx

glibc version is 2.40
Check TCMalloc: libtcmalloc_minimal.so.4
libtcmalloc_minimal.so.4 is linked with libc.so,execute LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4
Python 3.10.14 (main, Sep  5 2024, 22:06:38) [GCC 14.2.1 20240904]
Version: v1.10.1
Commit hash: 82a973c04367123ae98bd9abdf80d9eda9b910e2
Installing torch and torchvision
Looking in indexes: https://download.pytorch.org/whl/rocm6.0
Collecting torch
  Downloading https://download.pytorch.org/whl/rocm6.0/torch-2.4.1%2Brocm6.0-cp310-cp310-linux_x86_64.whl (2363.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 GB 53.4 MB/s eta 0:00:00
Collecting torchvision
  Downloading https://download.pytorch.org/whl/rocm6.0/torchvision-0.19.1%2Brocm6.0-cp310-cp310-linux_x86_64.whl (65.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.8/65.8 MB 21.0 MB/s eta 0:00:00
Collecting torchaudio
  Downloading https://download.pytorch.org/whl/rocm6.0/torchaudio-2.4.1%2Brocm6.0-cp310-cp310-linux_x86_64.whl (1.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 13.2 MB/s eta 0:00:00
Collecting filelock (from torch)
  Downloading https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl (11 kB)
Collecting typing-extensions>=4.8.0 (from torch)
  Downloading https://download.pytorch.org/whl/typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting sympy (from torch)
  Downloading https://download.pytorch.org/whl/sympy-1.12-py3-none-any.whl (5.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.7/5.7 MB 28.9 MB/s eta 0:00:00
Collecting networkx (from torch)
  Downloading https://download.pytorch.org/whl/networkx-3.2.1-py3-none-any.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 30.4 MB/s eta 0:00:00
Collecting jinja2 (from torch)
  Downloading https://download.pytorch.org/whl/Jinja2-3.1.3-py3-none-any.whl (133 kB)
Collecting fsspec (from torch)
  Downloading https://download.pytorch.org/whl/fsspec-2024.2.0-py3-none-any.whl (170 kB)
Collecting pytorch-triton-rocm==3.0.0 (from torch)
  Downloading https://download.pytorch.org/whl/pytorch_triton_rocm-3.0.0-cp310-cp310-linux_x86_64.whl (341.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 341.7/341.7 MB 58.9 MB/s eta 0:00:00
Collecting numpy (from torchvision)
  Downloading https://download.pytorch.org/whl/numpy-1.26.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 61.1 MB/s eta 0:00:00
Collecting pillow!=8.3.*,>=5.3.0 (from torchvision)
  Downloading https://download.pytorch.org/whl/pillow-10.2.0-cp310-cp310-manylinux_2_28_x86_64.whl (4.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 56.8 MB/s eta 0:00:00
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Downloading https://download.pytorch.org/whl/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Collecting mpmath>=0.19 (from sympy->torch)
  Downloading https://download.pytorch.org/whl/mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 248.3 MB/s eta 0:00:00
Installing collected packages: mpmath, typing-extensions, sympy, pillow, numpy, networkx, MarkupSafe, fsspec, filelock, pytorch-triton-rocm, jinja2, torch, torchvision, torchaudio
Successfully installed MarkupSafe-2.1.5 filelock-3.13.1 fsspec-2024.2.0 jinja2-3.1.3 mpmath-1.3.0 networkx-3.2.1 numpy-1.26.3 pillow-10.2.0 pytorch-triton-rocm-3.0.0 sympy-1.12 torch-2.4.1+rocm6.0 torchaudio-2.4.1+rocm6.0 torchvision-0.19.1+rocm6.0 typing-extensions-4.9.0
Installing clip
Installing open_clip
Cloning assets into /home/tux/stable-diffusion-webui/repositories/stable-diffusion-webui-assets...
Cloning into '/home/tux/stable-diffusion-webui/repositories/stable-diffusion-webui-assets'...
Cloning Stable Diffusion into /home/tux/stable-diffusion-webui/repositories/stable-diffusion-stability-ai...
Cloning into '/home/tux/stable-diffusion-webui/repositories/stable-diffusion-stability-ai'...
Cloning Stable Diffusion XL into /home/tux/stable-diffusion-webui/repositories/generative-models...
Cloning into '/home/tux/stable-diffusion-webui/repositories/generative-models'...
Cloning K-diffusion into /home/tux/stable-diffusion-webui/repositories/k-diffusion...
Cloning into '/home/tux/stable-diffusion-webui/repositories/k-diffusion'...
Cloning BLIP into /home/tux/stable-diffusion-webui/repositories/BLIP...
Cloning into '/home/tux/stable-diffusion-webui/repositories/BLIP'...
Installing requirements

---

[automatic] | glibc version is 2.40
[automatic] | Check TCMalloc: libtcmalloc_minimal.so.4
[automatic] | libtcmalloc_minimal.so.4 is linked with libc.so,execute LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.4
[automatic] | Python 3.10.14 (main, Sep  5 2024, 10:36:08) [GCC 14.2.1 20240805]
[automatic] | Version: v1.10.1
[automatic] | Commit hash: 82a973c04367123ae98bd9abdf80d9eda9b910e2
[automatic] | Launching Web UI with arguments: 
[automatic] | amdgpu.ids: No such file or directory
[automatic] | amdgpu.ids: No such file or directory
[automatic] | no module 'xformers'. Processing without...
[automatic] | no module 'xformers'. Processing without...
[automatic] | No module 'xformers'. Proceeding without it.
[automatic] | Calculating sha256 for /home/tux/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors: Running on local URL:  http://127.0.0.1:7860
[automatic] | 
[automatic] | To create a public link, set `share=True` in `launch()`.
[automatic] | Startup time: 7.2s (prepare environment: 2.5s, import torch: 2.1s, import gradio: 0.5s, setup paths: 0.8s, other imports: 0.5s, list SD models: 0.1s, load scripts: 0.2s, create ui: 0.3s).
[automatic] | 6ce0161689b3853acaa03779ec93eafe75a02f4ced659bee03f50797806fa2fa
[automatic] | Loading weights [6ce0161689] from /home/tux/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
[automatic] | Creating model from config: /home/tux/stable-diffusion-webui/configs/v1-inference.yaml
[automatic] | /home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[automatic] |   warnings.warn(
[automatic] | loading stable diffusion model: RuntimeError
[automatic] | Traceback (most recent call last):
[automatic] |   File "/home/tux/.pyenv/versions/3.10.14/lib/python3.10/threading.py", line 973, in _bootstrap
[automatic] |     self._bootstrap_inner()
[automatic] |   File "/home/tux/.pyenv/versions/3.10.14/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[automatic] |     self.run()
[automatic] |   File "/home/tux/.pyenv/versions/3.10.14/lib/python3.10/threading.py", line 953, in run
[automatic] |     self._target(*self._args, **self._kwargs)
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/initialize.py", line 149, in load_model
[automatic] |     shared.sd_model  # noqa: B018
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/shared_items.py", line 175, in sd_model
[automatic] |     return modules.sd_models.model_data.get_sd_model()
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_models.py", line 693, in get_sd_model
[automatic] |     load_model()
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_models.py", line 845, in load_model
[automatic] |     load_model_weights(sd_model, checkpoint_info, state_dict, timer)
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_models.py", line 440, in load_model_weights
[automatic] |     model.load_state_dict(state_dict, strict=False)
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 223, in <lambda>
[automatic] |     module_load_state_dict = self.replace(torch.nn.Module, 'load_state_dict', lambda *args, **kwargs: load_state_dict(module_load_state_dict, *args, **kwargs))
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 221, in load_state_dict
[automatic] |     original(module, state_dict, strict=strict)
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2175, in load_state_dict
[automatic] |     load(self, state_dict)
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
[automatic] |     load(child, child_state_dict, child_prefix)  # noqa: F821
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
[automatic] |     load(child, child_state_dict, child_prefix)  # noqa: F821
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2163, in load
[automatic] |     load(child, child_state_dict, child_prefix)  # noqa: F821
[automatic] |   [Previous line repeated 1 more time]
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2157, in load
[automatic] |     module._load_from_state_dict(
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 225, in <lambda>
[automatic] |     linear_load_from_state_dict = self.replace(torch.nn.Linear, '_load_from_state_dict', lambda *args, **kwargs: load_from_state_dict(linear_load_from_state_dict, *args, **kwargs))
[automatic] |   File "/home/tux/stable-diffusion-webui/modules/sd_disable_initialization.py", line 191, in load_from_state_dict
[automatic] |     module._parameters[name] = torch.nn.parameter.Parameter(torch.zeros_like(param, device=device, dtype=dtype), requires_grad=param.requires_grad)
[automatic] |   File "/home/tux/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/_meta_registrations.py", line 4964, in zeros_like
[automatic] |     res.fill_(0)
[automatic] | RuntimeError: HIP error: shared object initialization failed
[automatic] | HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[automatic] | For debugging consider passing AMD_SERIALIZE_KERNEL=3.
[automatic] | Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
[automatic] | 
[automatic] | 
[automatic] | 
[automatic] | Stable diffusion model failed to load
[automatic] | Applying attention optimization: Doggettx... done.
[automatic] | ./webui.sh: line 304:   191 Segmentation fault      (core dumped) "${python_cmd}" -u "${LAUNCH_SCRIPT}" "$@"

@xangelix
Copy link

xangelix commented Sep 5, 2024

Okay, well, restarting fixed the issue for myself. If you haven't tried that I guess definitely do, even if host system libraries haven't changed. It appears that sometimes a previous GPU compute operation doesn't end or close properly, and it generates seemingly random errors until reset fully.

@schung-amd
Copy link

schung-amd commented Oct 1, 2024

I can't reproduce this on a fresh install of Arch with ROCm 6.0.2 (from pacman) on a 7900XTX, but I did have to modify the installation process slightly as the instructions in https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#user-content-install-on-amd-and-arch-linux weren't working out of the box. The changes I made:

  • Specifically installed python 3.10 and used it to create the venv
  • Manually downgraded torch packages, as the requirements pulled in torch packages that weren't built for ROCm:
pip3.10 uninstall torch torchvision
pip3.10 install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm6.0

I am also using the command-line args --no-half and --no-half-vae as these are required to avoid crashes on many AMD GPUs; these may or may not be necessary on your system.

@AphidGit @Kamishirasawa-keine Are you still experiencing this issue? If so, can you try switching your torch packages for these?
@xangelix Was your issue a one-time thing, or do you have to restart periodically to fix this error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants