Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remaining issues with RDNA3 and 0.5.2 (kernel 6.7) #255

Open
43615 opened this issue Jan 24, 2024 · 41 comments
Open

Remaining issues with RDNA3 and 0.5.2 (kernel 6.7) #255

43615 opened this issue Jan 24, 2024 · 41 comments

Comments

@43615
Copy link

43615 commented Jan 24, 2024

Note that I do have the kernel parameteramdgpu.ppfeaturemask=0xffffffff.
image

  • Fan speed always shows 0. Seems to be a global kernel/driver issue (sensors also shows 0).
  • Static fan control doesn't work. Can't tell if there's a cutoff due to the above. Curve works fine.
  • Power limit can't be raised above 350 W. Slider goes to 389, which is also wrong (this card has a default of 420).
    image
@ilya-zlobintsev
Copy link
Owner

  • Fan speed reading: not much can be done on LACT's side if it's a lower level reporting issue
  • Static fan speed: the static setting works by setting a curve with all of the points at the same speed. Are you sure the behaviour is different when you're using a custom curve? What i've previously seen during testing is that the GPU might have a point below which it turns off the fan regardless of settings, but once it crosses that point it starts using the configured speed. Maybe that's what is happening?
  • Power limit: this is a known issue on the kernel side, it's being worked on.

@43615
Copy link
Author

43615 commented Jan 24, 2024

Glad to hear about the power limit, hopefully that's coming soon.
As for the fan control: It ramps up correctly when using the curve, but static speed doesn't seem to do anything at any level (even 100%). Seems like it doesn't change the speed at all. I might test it some more tomorrow, but it's hard to get accurate results due to the broken speed reading.

@ilya-zlobintsev
Copy link
Owner

If you can manage to replicate the proper "static speed" behaviour using a curve (by having all of its points on the same speed), then please tell me what that curve looks like. The current implementation uses a single minimum temperature point and fills the rest with the maximum, it might perform differently if the curve is configured in some other way.

You can check the actual curve that's applied in:

cat /sys/class/drm/card*/device/gpu_od/fan_ctrl/fan_curve

@43615
Copy link
Author

43615 commented Jan 28, 2024

Sorry, didn't get around to this until now.
Apparently I can't actually get it to a custom speed at all! The curve does get applied judging by your command, but there doesn't seem to be any effect.
It only ramps up when hot (testing with a short benchmark) and nothing I change affects that behavior. Again, I can only test it by ear due to the broken readout.

@Dominik-Zehnter-17
Copy link

I have a 7800XT and have tested controlling the fan curves. If you remain below a usage limit on the GPU, fan control does nothing. Only once you have higher usage (when booting up a game e.g.) the GPU\ applies your fan curves. This seems to be a hardware/driver issue that LACT has no control over. Run a game in the background, then play with your fan curves, that's what worked for me

@Nama
Copy link

Nama commented Jan 30, 2024

On my XFX SPEEDSTER MERC 310 AMD Radeon RX 7900 XT the fan speed is correctly read.

Setting static fan speed or a curve doesn't error, but wont work. Same with tuxclocker.

This works:

echo "0 36 20" | sudo tee /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "1 40 30" | sudo tee /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "2 45 35" | sudo tee /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "3 50 40" | sudo tee /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "4 55 45" | sudo tee /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
echo "c" | sudo tee /sys/class/drm/card*/device/gpu_od/fan_ctrl/fan_curve

But its not possible to change the mode to manual for static fan speed:

# echo 1 > /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
# cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
2
# echo 100 > /sys/class/drm/card0/device/hwmon/hwmon2/pwm1
# dmesg
amdgpu: manual fan speed control should be enabled first

debug.tgz
Can't upload .tar files here, maybe change it to .tgz

PS: I had to chmod o+rw /var/run/lactd.sock to make the GUI connect to the daemon.

@ilya-zlobintsev
Copy link
Owner

This works:

Setting a curve through lact should use exactly the same commands, can you check how the contents of fan_curve differ between setting it manually through these commands and setting the same curve in lact?

But its not possible to change the mode to manual for static fan speed:

This is expected, the hwmon interface is readonly on RDNA3

I had to chmod o+rw /var/run/lactd.sock to make the GUI connect to the daemon.

You should add your user's group to the start of admin_groups under daemon in /etc/lact/config.yaml.

@Nama
Copy link

Nama commented Jan 31, 2024

nvm, can't get it working again
echoing:

# cat /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
OD_FAN_CURVE:
0: 30C 55%
1: 40C 65%
2: 45C 70%
3: 50C 75%
4: 55C 80%
OD_RANGE:
FAN_CURVE(hotspot temp): 25C 100C
FAN_CURVE(fan speed): 15% 100%

LACT:

# cat /sys/class/drm/card0/device/gpu_od/fan_ctrl/fan_curve
OD_FAN_CURVE:
0: 40C 100%
1: 50C 100%
2: 60C 100%
3: 70C 100%
4: 80C 100%
OD_RANGE:
FAN_CURVE(hotspot temp): 25C 100C
FAN_CURVE(fan speed): 15% 100%

I got it to spin up once a few days ago, but it was weird and didn't seem right...

This is expected, the hwmon interface is readonly on RDNA3

maaaaan, I thought everythings implemented on 6.7 >_<

@ilya-zlobintsev
Copy link
Owner

maaaaan, I thought everythings implemented on 6.7 >_<

This isn't a missing feature, it's a change in how the GPU firmware works, you're supposed to use the new fan_curve and target temperature/speed interfaces instead of it.

@ilya-zlobintsev
Copy link
Owner

There have been some updates regarding the power limit setting in kernel 6.7.3:

drm/amd/pm: update the power cap setting
drm/amd/pm: Fetch current power limit from FW

It's worth checking if that helps with the incorrect limit

@FerrumMaster
Copy link

But there is problem with OC in general. When you enable static FAN it breaks OC settings being saved, they reset back to stock.

OC wise it is still a mess. While Kernel 6.8 allows you to set the right power limit now, it uses it in weird fashion and breaks clocking higher, thus you get slower performance.

@meminens
Copy link

meminens commented Feb 23, 2024

I have 7900 XTX on Arch Linux. Power limit works but fan control doesn't. System still turns on/off the fan at built-in card thresholds instead of the custom curve I set up using the LACT GUI. I have both tried the curve and static. No changes to the fan speed at all. Any recommendations?

Note that the OC is enabled, system rebooted, and I do have the kernel parameter amdgpu.ppfeaturemask=0xffffffff

Debug file: LACT-sysfs-snapshot-20240223-193349.zip

@ilya-zlobintsev
Copy link
Owner

System still turns on/off the fan at built-in card thresholds

Unfortunately there isn't anything you can do about this currently. It will use your custom settings after it crosses the threshold, but you cannot configure this threshold.

@meminens
Copy link

System still turns on/off the fan at built-in card thresholds

Unfortunately there isn't anything you can do about this currently. It will use your custom settings after it crosses the threshold, but you cannot configure this threshold.

Thanks for responding to my message. Does this mean Lact will never work for my card? Or is it something fixed?

@ilya-zlobintsev
Copy link
Owner

If the driver adds support for configuring this, then LACT will have an option for it.

@meminens
Copy link

meminens commented Feb 24, 2024

Thank you again. Much appreciate your time replying to me. If you don't mind, one final question, what's the best way to find out if the driver will add a support for configuring fan curves? Should I follow the linux kernel updates?

I found the following which appears to be adding fan control support for RDNA3 cards. Not sure why mine still doesn't work though.

https://lore.kernel.org/lkml/CAPM=9txd+1FtqU-R_8Zr_UePUzu7QUWsDBV1syKBo16v_gx2XQ@mail.gmail.com/

Linux arch 6.7.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 23 Feb 2024 16:31:48 +0000 x86_64 GNU/Linux

@ilya-zlobintsev
Copy link
Owner

Fan control itself is supported, the card will use your custom fan speed settings, but only after a builtin threshold when the fan gets turned on - that's the part you cannot currently configure.

As for updates: kernel changelog will have info about it if something changes, you can also track these issues in amd's repo:
https://gitlab.freedesktop.org/drm/amd/-/issues/2406
https://gitlab.freedesktop.org/drm/amd/-/issues/2402

@meminens
Copy link

but only after a builtin threshold when the fan gets turned on

Oh I see it now. That makes sense. I was wondering why the fan speed goes up and down randomly. So it does recognize my custom curve but still tied to the built in thresholds. Thanks for the explanation. I will check out the links you included.

@In-line
Copy link
Contributor

In-line commented Mar 3, 2024

XFX merc 310 7900XTX
ArchLinux 6.7.6-zen-1-2

Any change to FAN settings results in Input/Output error and failure to change settings again, but the following manual script works. Although I would prefer GUI to be fixed.

#!/usr/bin/bash

GPU_DEVICE="/sys/class/drm/card1/device"
GPU_SYSFS_FAN="$GPU_DEVICE/gpu_od/fan_ctrl"
GPU_SYSFS_HWMON="$GPU_DEVICE/hwmon/hwmon0"

POWER_LIMIT=402 # watts
GPU_FAN_CURVE="$GPU_SYSFS_FAN/fan_curve"
GPU_FAN_CURVE_0="0 30 15"
GPU_FAN_CURVE_1="1 40 30"
GPU_FAN_CURVE_2="2 50 60"
GPU_FAN_CURVE_3="3 60 70"
GPU_FAN_CURVE_4="4 75 100"

GPU_FAN_TARGET="$GPU_SYSFS_FAN/fan_target_temperature"
GPU_FAN_TARGET_TEMP="85"

echo "Setting fan curve"
echo "$GPU_FAN_CURVE_0" > "$GPU_FAN_CURVE"
echo "$GPU_FAN_CURVE_1" > "$GPU_FAN_CURVE"
echo "$GPU_FAN_CURVE_2" > "$GPU_FAN_CURVE"
echo "$GPU_FAN_CURVE_3" > "$GPU_FAN_CURVE"
echo "$GPU_FAN_CURVE_4" > "$GPU_FAN_CURVE"
echo "c" > "$GPU_FAN_CURVE"
echo "Committed fan curve"

echo "Setting power limit"
echo "$((POWER_LIMIT * 1000000))" > "$GPU_SYSFS_HWMON/power1_cap"
echo "Comitted power limit"

cat $GPU_SYSFS_FAN/fan_curve

@In-line
Copy link
Contributor

In-line commented Mar 3, 2024

After some debugging problematic line of code is this, ignoring error here fixes the issue on 7900 XTX. I would try to prepare some patch or workaround, but I'm not sure how this will impact RDNA2 or older cards.

// Reset the power profile mode for switching to/from manual performance level
self.daemon_client
.set_power_profile_mode(&gpu_id, None)
.context("Could not set default power profile mode")?;

@ilya-zlobintsev
Copy link
Owner

@In-line could you post the full error that happens when you try to apply settings as well as your /etc/lact/config.yaml? The line you linked doesn't change anything fan related, but it does trigger a reapply of existing settings, so maybe it is trying to apply an invalid configuration.

@In-line
Copy link
Contributor

In-line commented Mar 3, 2024

@ilya-zlobintsev Already fixed it myself in #279

@dinotheextinct
Copy link

Uhm I have the problem, regardless of game I start with LACT, my GPU is stuck at 100% usage. The GPU Clock is kind of "locked around 2200 Mhz and the current stays around 750mV.

signal-2024-05-04-112043

Once I change ANY setting and apply it while the game is running that "lock" is lifted and the GPU seems to ignore any settings made with LACT.

LACT-sysfs-snapshot-20240504-113539.tar.gz

@In-line
Copy link
Contributor

In-line commented May 4, 2024

@dinotheextinct More info please.

Kernel version, mesa version, LACT version, distribution, etc..

@dinotheextinct
Copy link

Is the info not in the sysfs snapshot?

@dinotheextinct
Copy link

Kernel 6.8.8-1-default
glxinfo | grep Mesa client glx vendor string: Mesa Project and SGI OpenGL core profile version string: 4.6 (Core Profile) Mesa 24.0.5 OpenGL version string: 4.6 (Compatibility Profile) Mesa 24.0.5 OpenGL ES profile version string: OpenGL ES 3.2 Mesa 24.0.5
LACT 0.5.4
Opensuse Tumbleweed

@In-line
Copy link
Contributor

In-line commented May 4, 2024

@dinotheextinct You're using 0.5.3 version of the LACT. RX 7900 has known problems in it, update to the last version. This is what I fetched from info.json in sysfs-snapshot.

{
  "initramfs_type": null,
  "system_info": {
    "amdgpu_overdrive_enabled": true,
    "commit": "d99cfdf",
    "kernel_version": "6.8.8-1-default",
    "profile": "release",
    "version": "0.5.3"
  }
}

@dinotheextinct
Copy link

sorry I just updated it, the issue is exactly the same after updating, I just attached the sysfs snapshot again, but like I said issue is the same:
LACT-sysfs-snapshot-20240504-120732.tar.gz

@JosephM0on
Copy link

Hello, I wanted to add some information, I don't know if it will be useful, I bought an AMD 7800XT Sapphire Nitro+ graphics card but I can't control the fans, I noticed that there is a problem where other users also reported this situation, I wanted to leave some information, but I'm new to this and I don't know exactly how to do it, I'll put as much information as possible here. Thankyou
LACT-sysfs-snapshot-20240708-100500.tar.gz

I am using the BazziteOS operating system

ostree-image-signed:docker://ghcr.io/ublue-os/bazzite:stable
Version: 40.20240707.0 (2024-07-07T23:38:35Z)
LayeredPackages: coolercontrol liquidctl mullvad-vpn
LocalPackages: lact-0.5.4-0.x86_64

information
oc
software
thermals

@ilya-zlobintsev
Copy link
Owner

@JosephM0on please check if the problem is resolved in the test build

@JosephM0on
Copy link

@ilya-zlobintsev Sorry for the delay in responding, I'm just leaving work now, I'll check as soon as possible, Thank you

@JosephM0on
Copy link

I performed the following steps
-> rpm-ostree remove lact
-> systemctl reboot
-> rpm-ostree install lact-0.5.5-0.x86_64.fedora-40.rpm
-> systemctl reboot

It looks like it's still the same, do I need to do any more steps?

LACT-sysfs-snapshot-20240708-204119.tar.gz
thermals

@ilya-zlobintsev
Copy link
Owner

Could you be more specific about "can't control the fans"? There's a known limitation to rdna3 fan control, is that the problem?

There is an unconfigurable temperature threshold below which the fan does not get turned on, even with a custom curve.

@43615
Copy link
Author

43615 commented Jul 8, 2024

Haven't checked on this in a while, glad to hear that someone else also cares about this problem. @ilya-zlobintsev what do you mean by "unconfigurable"? Is it being worked on or abandoned? I know firsthand that it is configurable on Windows with AMD's official software (and the power limit is also correct for non-stock cards).

@ilya-zlobintsev
Copy link
Owner

Unconfigurable means that the GPU keeps the fan off below a certain temperature, even with a custom curve. There have been mentions of it on the drm issue tracker, but i don't believe there was any recent activity regarding this unfortunately.

@JosephM0on
Copy link

Could you be more specific about "can't control the fans"? There's a known limitation to rdna3 fan control, is that the problem?

There is an unconfigurable temperature threshold below which the fan does not get turned on, even with a custom curve.

Sorry, I didn't know that this was a limitation, I thought it was possible to control the speed of the fans, since and when the values ​​are entered, as for example in the image I placed in my last comment.

@neon-grim
Copy link

I dont know if this is related to this issue, but when maxing my power limit through lact (402w), my average power draw lands around 350w instead of 402w under full load. This isnt the case when using corectrl. When setting the power limit to 402, my 7900XTX draws 402w on average under full load. As consequence of this behavior, my overclock on my 7900XTX pulse becomes unstable under lact when compared to corectrl, which leads to artifacting in certain games.

This would be my system:

  • OS: Nixos unstable branch
  • Mesa: 24.2.1
  • Kernel: xanmod-6.10.7
  • GPU: Sapphire 7900XTX pulse
  • CPU: 7800X3D
  • RAM: 2x16GB 6000mhz CL30 DDR5

@robertosw
Copy link

robertosw commented Sep 19, 2024

I've upgraded to a Sapphire Nitro+ 7800XT today. This thread really help relieve my worries that this card might be broken, because the fans seemed to do weird things :D

I can confirm the problem that the fans start at some temperature which the card has in its firmware. The point at which the fans start is definitely not related to power draw or usage, but to one of the three temperatures (junction / mem / normal) - I'll test around to see when my card turns its fans off and on and share some results.

Changing the power limit seems to work fine for me. At least its set without errors. I cant check this well (physically), because at 100% gpu util, the card newer pulls more than 190W.
190W is the lowest power limit setting for me. Setting the limit to 190W results in the same fluctuations +- 5W around 190W as before with higher power limit.
I'll try reaching a stable 200W power draw by overclocking slightly, to test if the power limit works

What I did notice: There has to be some reporting error in the clock speeds or voltage:

  • "performance level" set to "Lowest Clocks":
    • target core clock and avg core clock are 2.550GHz
    • PPT is 145W
    • 100FPS.
  • "performance level" set to "Automatic" or "Highest Clocks":
    • target core clock and avg core clock are still at 2.550GHz
    • PPT is 190W
    • 130FPS.

Fiddling around with the clock speeds on "Automatic", it appears that 2.15 GHz lead to ~145W power draw. So there really is some clock speed reporting error in the "Lowest Clocks" Mode.

I tested this in the pause menu of horizon zero dawn which actually keeps on rendering the still image over and over.

Update1

Overclocking way above 190W doesnt work for my card. Even though "Minimum GPU Clock" and "Maximum GPU Clock" can be set up to 5GHz (at least the slider goes to these values), setting MIN Clock above 2.565GHz results in an error. At this clock speed my GPU consumes just slightly more than 190W, but not enough to reliably test if applying the power target works.

Update2

After a lot of testing, I am 100% certain that my GPU uses the "junction" temperature as the determining factor for fan speed. This would explain why my fan settings felt so weird - junction behaves non-linearly to the edge temp but is always at least 8°C hotter. ADDITIONALLY the fans turn off at ~55°C junction but turn back on at ~65°C. This behavior is so annoying, because I can hear the fans ramp up in 20s intervals in some games. Cant wait for AMD to give us full access to the fans.
Is there some way of detecting which temperature is used by the onboard firmware for fan control? (This is not just used for deciding when to turn the fan on, but also for deciding the actual fan speed, if set via curve)

Maybe a fan curve thats controlled by the average wattage of the last 5s would be more intuitive?
After calculating the 5s average, LACT could look up the wattage-to-fan-speed-curve to obtain the target fan speed and set the fan speed in the static mode. So it constantly applies a new static fan speed
This might be a bit special on the implementation side, but I feel this would be more intuitive. Any opinions?

@ilya-zlobintsev
Copy link
Owner

@SeekerOfAsh could you please make a:

  • Debug snapshot with the setting applied through LACT
  • Disable the lact service, apply the configuration through corectrl, then make a debug snapshot in lact (running in embedded mode with no service)

And link both of them here to see the difference in how the settings are applied.

@neon-grim
Copy link

Reply to #255 (comment)

Hi, I'll try to test once I have the time. But it seems the bug is not unique to LACT and instead its the kernel driver misbehaving. Corectrl exhibits the same issue when a profile is activated, resulting in an average power draw of only 350w. In order for the correct power limit to be active, I need to apply a different power limit and then change back to 402w (max power limit) while the profile is active. This results in the correct average power draw of 402w under 100% load.

@ilya-zlobintsev ilya-zlobintsev pinned this issue Oct 26, 2024
@ilya-zlobintsev
Copy link
Owner

Apparently someone made a patch that adds a zero RPM setting: https://gitlab.freedesktop.org/drm/amd/-/issues/3489#note_2626120

Hopefully it's upstreamed, then this can become a setting in LACT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests