Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input/Output Errors and PCI devices unavailable after suspend #3049

Closed
danjeffery opened this issue Aug 24, 2017 · 97 comments
Closed

Input/Output Errors and PCI devices unavailable after suspend #3049

danjeffery opened this issue Aug 24, 2017 · 97 comments

Comments

@danjeffery
Copy link

danjeffery commented Aug 24, 2017

Qubes OS version (e.g., R3.2):

3.2 and 4.0rc1

Affected TemplateVMs (e.g., fedora-23, if applicable):

dom0, sys-net


Expected behavior:

Qubes can be suspended and recover from suspend

Actual behavior:

After suspend Qubes is unstable.
Behavior is inconsistent. Sometimes networking is just disabled and nmcli in the sys-net system VM reports that the ethernet and wireless devices are unavailable and system is otherwise fine. At other times the sys-net VM is unresponsive or shuts down completely and dom0 gives input/output errors when attempting to open terminals or shutdown. The errors from dom0 are also bizarre as they affect different command from one test to the next. Sometimes lspci will throw the error, other times dmesg, ls or less will throw an error and lspci is fine. Often initctl will give the input/output error and the system must be restarted.

Steps to reproduce the behavior:

Suspend qubes (close lid, use menu and echo mem > /sys/power/state all produce the same result)
Awaken (lift lid, push power button, it ignores keystrokes on my laptop)

General notes:

Hardware is a Lenovo X1 Carbon gen3, wireless adapter is Intel 7265 rev 59. I've run full system diagnostics on the laptop and it passes. I've tested different nvme drives with no benefit.

I've tried the steps from #2922 and https://www.qubes-os.org/doc/wireless-troubleshooting/#automatically-reloading-drivers-on-suspendresume, but they have not helped.


Related issues:

https://groups.google.com/forum/#!topic/qubes-users/LkP-6ORGwME
#2922

@andrewdavidwong
Copy link
Member

This sounds like it might be a duplicate of #3008. The workaround is to blacklist iwlmvm. (See issue comments for details.)

@andrewdavidwong andrewdavidwong added the R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. label Aug 25, 2017
@andrewdavidwong
Copy link
Member

(If it turns out not to be a duplicate, let me know, and we can reopen this.)

@danjeffery
Copy link
Author

I wish it were. That is in the troubleshooting steps recommended in the wireless-troubleshooting doc. Unfortunately adding iwlmvm to /rw/config/suspend-module-blacklist made made no difference for me in testing.

@danjeffery
Copy link
Author

I'm going to test downgrading to 3.2 and rolling back the kernel to see if that corrects it.

@z4ppy
Copy link

z4ppy commented Aug 25, 2017 via email

@danjeffery
Copy link
Author

@z4ppy, unfortunately I didn't have it until I updated to 4.9.35-19 :(

@rtiangha
Copy link

@danjeff: Yes, you need to restart sys-net once you've modified the blacklist file. You also need to make sure to blacklist both iwlmvm and iwlwifi in the file, not just iwlmvm.

@danjeffery
Copy link
Author

danjeffery commented Aug 25, 2017

I blacklisted both iwlmvm and iwlwifi. I restarted Qubes entirely. The problem persisted. Just to be sure I am not remembering incorrectly or making some mistake, I will try it again with the latest kernel.

I have reinstalled 3.2 fresh with it running 4.4.14-11 and there is no problem. Suspend works fine and I do not get the odd lockups or input/output errors. I'll update now to latest 4.9.35-19 and fedora 25 for the template and put in the blacklist lines.

@danjeffery
Copy link
Author

danjeffery commented Aug 25, 2017

It looks like I spoke too soon. The sys-net VM had not crashed after the suspend on the older kernel and NetworkManager reported the wifi connection was still up, but it wasn't passing any traffic and once I downed the connection nmcli couldn't bring it up again.

I've blacklisted iwlwifi and iwlmvm in sys-net:/rw/config/suspend-module-blacklist and restarted the sys-net VM. The behavior from the previous paragraph was exactly repeated.

A key difference worth noting in the behavior on the older kernel is that Network Manager still thinks the connection is active and the device is connected. Also, I don't seem to be getting the bizarre behavior in dom0. Also, restarting the sys-net VM is possible and everything works again afterward.

I'm going to proceed to update to fedora-25 and the newer kernel.

@danjeffery
Copy link
Author

Okay, on 3.2 with kernel 4.9.35-19 and the fedora 25 template, I am currently seeing the same suspend behavior as on 4.4.14-11 with fedora 23. sys-net:/rw/config/suspend-module-blacklist contains iwlmvm and iwlwifi, each on their own line.

Since I don't seem to be getting the input/output errors anymore (for no reason?) on 3.2 I guess I'll stay here for now and just not suspend. I am very open to other ideas or troubleshooting.

@danjeffery
Copy link
Author

Hooked back up my USB mouse and found the sys-net VM has the same problems. After suspend, USB is also broken until the VM is restarted.

And now the input/output errors are back in dom0 and I can't stop and restart the VMs :)

I'm not sure if that is the result of just running it long enough or because I restarted the VMs and suspended a second time without a reboot, but they're back. I'm really wondering if this is a hardware failure at this point, but all the Lenovo system diagnostics come back fine.

@rtiangha
Copy link

rtiangha commented Aug 26, 2017

I don't know; these symptoms are weird. I have an Intel 7260 dual ac card and it seems to work fine, although it's not a 7265 but one would think it was close enough. But it's also on a Dell L502X.

I noticed in your log output on the mail list that it couldn't load the wifi firmware. Just to double check, but is it actually installed (I assume it is, but you never know)?

sudo dnf install iwl7260-firmware or sudo dnf install linux-firmware (I'm not sure which; I'm a Debian guy)

Also, check the Lenovo website for any BIOS updates and if they exist, try applying them. Maybe this is a known hardware issue that's already been fixed and there are a few cases out there with similar symptoms on other distros and they all seem to come from Lenovo users so maybe this is something the manufacturer has already addressed in a BIOS update.

I'd also go into your BIOS and double check any ACPI, Power Management, and Virtualization settings and ensure that they are all enabled properly.

@danjeffery
Copy link
Author

It is weird. BIOS is a good point and I had upgraded the BIOS to latest right at the beginning of troubleshooting. The iwl7260 firmware is installed correctly by default. I don't know that the wireless or the USB layers are the right place to look at this point, though. This seems to be an issue with PCI passthrough after suspend or some other events that happen with time since I've had the issue start just after the machine has been running for a while.

The most frustrating part of this is the inconsistency. I have not had this problem since updating to 3.2 as soon as it released last year, but now it's present even on fresh 3.2 install. The problem seems to have started about 2 weeks ago.

I was able to get a hold of another identical (checked all the hardware chips and revisions) gen 3 X1 Carbon and compare against 3 more identical machines running Qubes, but not in my possession. The two in my possession that I have wiped and tested on Qubes 4, 3.2 as-installed and 3.2 up-to-date all exhibit the same behavior. Two of the other 3 seem to also exhibit the suspend behavior, but not the freezes while the 3rd is reported to be fine and is fully patched like the others.

I went ahead and booted up a live disk of Kali and there seem to be no issues. Suspend works fine and there are no unusual freezes or input/output errors. I'm at a bit of loss where to even look next, but at this point I can only use Qubes if I don't let it suspend and even then, it has repeatedly locked up on me and lost work in progress.

If it will help, I am perfectly willing to overnight or 2-day one of these laptops to a dev to help sort this out.

@rtiangha
Copy link

rtiangha commented Aug 26, 2017

As a last resort, maybe try one of @fepitre's 4.12 kernels that he posted to the mail list to see if it helps? The kernel options shouldn't be much different than 4.9's except for the new drivers introduced, but maybe the power management stuff is better. It seems it's kind of flakey in 4.9, especially when it comes to Intel wifi; having Intel power management disabled by default in the kernel makes suspend work for some cards and not others, and enabling it in the kernel flips it around (currently, it's disabled in-kernel because having it enabled was causing too many issues, but there's a sysctl or kernel value you can toggle to enable it yourself, but I don't know it off the top of my head).

https://sourceforge.net/projects/qubes-linux-kernel/files/

@andrewdavidwong andrewdavidwong added T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. and removed R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. labels Aug 27, 2017
@andrewdavidwong andrewdavidwong added this to the Release 3.2 updates milestone Aug 27, 2017
@danjeffery
Copy link
Author

Well, I thought maybe I'd gotten somewhere over the weekend as I reinstalled 3.2 again and ... everything worked. I suspended and restarted several times and everything seemed fine, so I crossed my fingers and updated the kernel and the template VM, but then it went back to locked up PCI devices and input/output errors. I tried setting everything to use the old kernel, 4.4.14-11, but the problems persisted. I tried Reg's suggestion of using the 4.12 kernels and still had the same problems.

My conclusion at this point is that while the kernel may be involved, it's not just the kernel that's the problem. I'm going to try reinstalling 3.2 fresh, again, and see if I can get the state where everything works. I have done a fresh install of 3.2 about 5 times in the last several days and only had it work correctly that once, so I'm not optimistic it will work, but I am at a loss as to why it would be different. BIOS settings are the same, install parameters are the same.

One issue I don't think I've mentioned is that when this was working correctly, the system can shutdown and restart successfully. When it's in the bad configuration it will hang on shutdown. In the bad configuration before anything appears to be broken (network and usb still working fine, no input/output errors) if I attempt to shutdown or restart I can watch it attempting 8 different background processes, all of which appear to be dismounts until it hits the 1m30s limit and then hang with a series of errors like device-mapper: remove ioctl on [device] failed: Device or resource busy. At this point I'm forced to hard power off by holding down the power button.

Once I've done a suspend or waited long enough for input/output errors and/or net and usb errors, when I attempt to restart I get blk_update_request: I/O error, dev sda sector [some number that changes each time]. The second error made me wonder about bad nvme, but swapping it didn't change anything.

@danjeffery danjeffery changed the title Input/Output Errors and network devices unavailable on suspend Input/Output Errors and PCI devices unavailable after suspend Aug 28, 2017
@rtiangha
Copy link

rtiangha commented Aug 28, 2017

So just to confirm, installing fresh from ISO works, updating system afterwards doesn't, and switching to an older kernel from that point still doesn't?

If you're going to re-install from scratch, can you a) capture dmesg output from both dom0 and sys-net and/or attach system logs when it's working, then just update kernel, kernel-qubes-vm (and kernel-devel if you have it) in dom0 (sudo qubes-dom0-update kernel kernel-qubes-vm) and try it again with the new kernel and report back if it still works or not (capture dmesg if it doesn't)? And if it does work, update dom0 and sys-net's template with the regular system updates and try again?

@rtiangha
Copy link

Also, at each step, verify the running kernel in both dom0 and sys-net by running uname -r as a sanity check.

@danjeffery
Copy link
Author

I changed the title to better reflect this does not appear to be primarily about the network. Both USB and Network VMs lose their devices and dom0 is having issues even running dmesg and sometimes lspci or reading logs.

@rtiangha It doesn't always work fresh from the ISO. I reinstalled 3.2 at least 5 times over the last week and only once did it work correctly. It's reinstalling right now. If it works, I'll collect the logs, update just the kernel and kernel-qubes-vm packages on dom0 and see what that gets us. As noted, I probably can't capture dmesg when it's not working as that command throws the input/output error nearly all the time once we're in the bad state (as well as trying to less/vi/grep anything in /var/log). I've been uname'ing for exactly that reason all along the way. :) Thanks for your help.

@rtiangha
Copy link

Cool. Keep the post updated. Full logs where possible would be helpful to at least see what's going on. Personally, I've never seen this behaviour ever.

Also, am I correct in thinking that you've got sys-net acting as a combined USBvm as well, or do you have a separate sys-usb VM?

@danjeffery
Copy link
Author

For all the tests over the last week I've had separate sys-usb and sys-net VMs. Are there any logs other than dmesg you'd like me to capture?

@danjeffery
Copy link
Author

The thing driving me nuts about this behavior is how inconsistent it's being. I'm trying to hold my install and config parameters totally consistent and think of anything I or the hardware are doing that could give different results, but I'm at a bit of a loss. One difference I just thought of between the two machines I'm testing with and the other three also running Qubes is the BIOS revision. These two are fully patched and I'm not sure if the other three have ever been patched from what the factory shipped.

@qubesos-bot
Copy link

Automated announcement from builder-github

The package core-agent-linux has been pushed to the r4.0 stable repository for the Fedora centos7 template.
To install this update, please use the standard update command:

sudo yum update

Changes included in this update

marmarek added a commit to QubesOS/qubes-core-agent-linux that referenced this issue Feb 12, 2018
It is necessary to blacklist them on (almost?) any hardware, so lets do
this by default.

Fixes QubesOS/qubes-issues#3049

(cherry picked from commit cfbc953)
@qubesos-bot
Copy link

Automated announcement from builder-github

The component core-agent-linux (including package python2-dnf-plugins-qubes-hooks-3.2.23-1.fc26) has been pushed to the r3.2 testing repository for the Fedora template.
To test this update, please install it with the following command:

sudo yum update --enablerepo=qubes-vm-r3.2-current-testing

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-agent_3.2.23-1+deb9u1 has been pushed to the r3.2 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing stretch-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-agent_3.2.23-1+deb10u1 has been pushed to the r3.2 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing buster-testing (or appropriate equivalent for your template version), then use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-agent_3.2.25-1+deb10u1 has been pushed to the r3.2 stable repository for the Debian template.
To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The package qubes-core-agent_3.2.25-1+deb9u1 has been pushed to the r3.2 stable repository for the Debian template.
To install this update, please use the standard update command:

sudo apt-get update && sudo apt-get dist-upgrade

Changes included in this update

@qubesos-bot
Copy link

Automated announcement from builder-github

The component core-agent-linux (including package python2-dnf-plugins-qubes-hooks-3.2.25-1.fc26) has been pushed to the r3.2 stable repository for the Fedora template.
To install this update, please use the standard update command:

sudo yum update

Changes included in this update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment