-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
usb: fix enumeration issues by propagating buffer overflow error. #738
Conversation
0a4c056
to
6e8743e
Compare
Rebased on present HEAD~1 to avoid inheriting CI errors. |
@mattthebaker, I'm going to run a few tests with this PR, but it seems like it solves some USB issues on my HW as well. |
There's some infrastructure for testing USB, related to the usb-device crate. Offhand, I don't know if it would be feasible to make a test case that reproduces this issue as the host side of that uses libusb, however it would be worth running the suite against this PR to catch any regressions (and to dust off the tests :) ). A good entry point to the current situation might be this issue - sorry I've not advanced it, but am now gainfully unemployed and maybe even getting past a period of burnout so could try to get back in to it. rust-embedded-community/usb-device#100 |
This will need to be rebased to fix the ci errors. |
6e8743e
to
c324f30
Compare
Excellent!
Hmm, I'm not sure if it can be easily created in user space tests since the OS will typically handle enumeration and believe SETUP is primarily used there? Any test that resets a device might be able to look for re-enumeration within some short period of time, but the host would need to be confirmed to be susceptible to the issue. Any kind of extended regression coverage would be nice. Fix is already in production for me and will quickly see use on at least a couple hundred devices and a lot of unique hosts (pretty much all Linux though). During debug I tested against usb_device 0.3.1 and 0.3.2 -- new debug hooks there ultimately helped identify the root cause. My device only uses CDC/ACM. |
@mattthebaker - thanks very much for your work on this, and apologies it's taken me so long to do anything with the PR! I've run the usb-device test suite against your branch on SAMD21 and SAMD51 (Adafruit Metro M0 and M4 boards), can't see any change there. I'm not quite sure this fixes the underlying issue, but I'm still getting back up to speed on USB and don't have a firm idea of a different fix. A couple items I'm currently pondering, which you might have considered already: In the circumstance where How about the case where the caller expects more data in a transfer, so supplies a larger buffer than an unexpected SETUP? I wonder if we should split Tangentially related: we should update the usb-device crate used by the HAL. I'll put that on my TO-DO list. |
The SAMD21 datasheet (DS40001882H) in section 32.6.2.6 describes management around SETUP transactions being handled largely in hardware. The USB peripheral will overwrite the endpoint buffer with an incoming SETUP transaction so long as EPINTFLAG.RXSTP is 0; effectively, SETUP transactions can't overwrite each other, but they can overwrite non-SETUP transactions. In a hypothetical |
I'm thinking the next step is to create a test case that aims to reproduce the issue that this PR is meant to fix. How does the failure manifest - is it just that the device doesn't enumerate successfully at all, does it take noticeably longer (I guess due to a retry), something else? (updated to add) My current thought is two types of test:
|
No problem, trying to help others that have likely seen the issue.
So the specific case I'm aware of where this happens, the status stage of a control transfer gets its ZLP clobbered by the next control transfer's SETUP. In this case, that ZLP is the final action of the first control transfer, and on reception by the device it will be ACKed if the OUT buffer is ready. The device could STALL if its not ready, but then the host would try again rather than proceeding. From the host's perspective the control transfer is now entirely complete with no errors. If it is a fast host it may move immediately into another control transfer SETUP. If there were something the device side stack could discard, in this case I believe you would prefer to accept it. However, the only case I'm aware that this happens is when the control transfer is reading -- IN transfers -- so afaik there isn't any device bound transfer that could be uncertain. To resolve the issue the next control transfer will need to start correctly.
I'm not aware of a case where the host would behave this way, but I'm not an expert. AFAIK the rapid SETUP only comes due to how the final status stage works. In other cases if the state machines get out of sync or have bugs they would STALL / NAK until the host decides to just reset the device bus and try again. Only the SETUP per spec bypasses those mechanisms and can lead to this issue. Splitting the read function into two might make sense.
Do you have a spot in the datasheet where you saw that comment about RXSTP? It seems that reading
It depends on how rapidly the host sends the SETUP and how quick the device is with clearing:
This should hit it directly with the caveat that the host its run on would need to have fast setup behavior and/or artificial delays slowing down the device to force the behavior. For reference it seems like many host controllers will only send 1 control transfer per start of frame, so 1ms, but others will blast them out with basically zero delay.
If that is possible with the host libraries that would be great for extra targeted coverage. A possible 3. is considering the timing dependent nature, would be to have a device build with intentional long delays (somewhere around 1-3ms?) inserted between handling events, and run significant parts of the existing regressions against it. |
Cool, sounds like progress. I don't have much time for programming today, but will try to implement a couple tests and a possible fix (which will involve changes to usb-device) soon.
It may be that a reasonable host implementation would never do that - I'd forgotten that the SETUP includes a length field, so the device knows how much data to expect.
Section 32.6.2.6 though it's a bit narrative. I'm envisioning that
Maybe it's not a real problem if the OUT ZLP is the only case where this occurs, since the device behaviour doesn't change depending on whether it successfully reads the ZLP or if the SETUP arrives before/during the ZLP handling. I guess this hinges on whether a host can write OUT data that the device needs, then follow up with a SETUP too quickly (as above) - my hunch is that this doesn't happen.
Do you have examples of devices (or maybe device+driver combos) in those two categories? I've got a small assortment of SBCs and laptops, have seen the occasional weird enumeration issue but nothing as consistent as your 2.
I think it should be - the usb-device test suite is pretty minimal, but based on libusb which can do control transfers.
Yep, that seems like a great idea. Should be easy to do between calls to |
Sounds pretty clean to me.
The two I've debugged personally are my desktop PC -- an Intel i7-12700K / Z690 chipset and Win10 Pro -- and the raspberry pi 5. This is in conjunction with a samd21e clocked at 32mhz and RTIC where USB had to be lower priority than some real time tasks. Older firmware build that had a polling loop instead was slightly more reliable with the pi5 (more likely to glitch or slowly enumerate), but never worked on the desktop PC. My users had issues on some other various other embedded linux boards but I would need to hunt down old notes. A Pi5 would be the simplest way to get a hardware test setup going, but need to note now that some field updates may have to be rolled back -- they issued their own fix (1 SETUP per SOF) as unrelated devices had similar issues, I'm not sure if they even ran a rust stack. https://forums.raspberrypi.com/viewtopic.php?t=365410 I'll be able to test any proposed alternate fixes in a pi5 setup. |
I've started working on this today, but only got a bit of refactoring the test suite done. WIP on usb-device is at https://github.com/ianrrees/usb-device/tree/enumeration .
Great, thanks! I don't have a Pi5 yet. |
I've implemented the enumeration time test and found that the control loopback test exists already[1]. With my laptop and a Metro M0 Express (SAMD21, no bootloader) I see about 95% of enumerations happening in something like 8-900ms, but some are about 6 seconds. Haven't yet made the delay test; I started looking at where to put the delay and didn't find a spot where it seemed to make the enumeration problem more likely. Proposed fixes span usb-device and atsamd-rs, I'd suggest using https://github.com/ianrrees/test-usb-device/tree/enumeration as an entry point to try the changes out. It's a little bit of a Git hack as the usb-device submodule commit is one from my usb-device repo. Probably the easiest thing to do is clone the test-usb-device repo with With those changes, I haven't seen any enumeration failures in ~40 runs of the test suite. (edit: now a few thousand, with both SAMD21 and SAMD51) [1] here, though we might want to revise the list of lengths. |
Forgot to mention, there's a little bit of Cargo secret sauce that you probably know about already, but is quite helpful here. The "patch" feature that can be used to replace the source of a crate through the whole dependency tree, for example here replaces usb-device. |
@mattthebaker, just checking in on this as it's been a few days. I've had a brief chat with the original author of usb-device and it seems the mixing up of SETUP and OUT data is a known issue so it would be good to PR a fix to usb-device, and once that's done we'd update the HAL to suit. With my laptop, I can see the changes I referenced above make enumerations more reliable, but it would be good to confirm whether is addresses the complete enumeration failures you've reported with RPi5 as well. Great find! I suspect that sorting this out will fix a longstanding annoyance with https://github.com/ianrrees/usb-spi which is used at my old job. |
Sorry, just realised I misread and the complete failures were with a desktop! @9names ran some tests of my proposed fix with RPi5 today and was able to reproduce the failures using current code, and see that the fix it appears to work. So, I'm thinking the next step is to get those changes in to usb-device, and once that's done we can make corresponding changes to the HAL. Will close this PR out in the mean time, since the new approach would overwrite these changes anyway. Thanks everyone! |
Summary
Fixes USB enumeration issues with fast host controllers.
The USB spec permits SETUP for control transfers to be sent at any time. The device is not permitted to STALL or NAK, so the USBD controller ignores BK0RDY status when it receives a SETUP.
Some faster host controllers may send a SETUP very soon after the previous OUT, for example the Raspberry Pi5 has been confirmed on USB analyzer to SETUP in as little as 20us after the previous OUT. This is a very narrow window to handle the OUT and maintain proper state.
In my project (atsamd21 at 32mhz cpu clock with USB at mid priority), by the time the OUT in the endpoint buffer is read it has already been overwritten by the SETUP packet. The USB stack is expecting a zero length packet to end the previous descriptor read, so it has passed in a zero length slice. The bank read finds that the endpoint has more than zero data and throws a BufferOverflow.
This change allows the error to propagate without dropping the SETUP in the endpoint buffer. The USB stack will then exit the previous descriptor read and proceed with the SETUP of the next control transfer.
In my project this fixes (at least) enumeration issues with Pi5, Pi CM4, and so far a couple of x86 class PCs running Windows or Linux. So far all hosts known to have enumeration issues that were tested with this change have been resolved.
Checklist
CHANGELOG.md
for the BSP or HAL updated