Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] No exit from function capture_request() #1086

Open
tutoduino opened this issue Aug 8, 2024 · 9 comments
Open

[BUG] No exit from function capture_request() #1086

tutoduino opened this issue Aug 8, 2024 · 9 comments

Comments

@tutoduino
Copy link

Please only report one bug per issue!

Describe the bug
Creating a time-lapse, I am capturing a large number of frames with a delay of 10 seconds between each capture.
Sometime capture_request() function takes an unexpected amount of time to return (>100 seconds) or even never exit.
I observe that the CPU is 100% (see attached htop) in this case.

To Reproduce
The problem happens randomly after several tens of captures, but it is easy to reproduce.

Expected behaviour
Almost constant capture time (in seconds) for capture_request() function call

Console Output, Screensho
htop
ts

If applicable, any console output or screenshots that show the problem and associated error messages.

Hardware :
Raspberry Pi Zero 2 W and camera v3 Raspberry Pi. Happens with several updated OS. For example Raspberry Pi OS Lite (32-bit)

Additional context
The code and steps are available in my tutorial here : https://tutoduino.fr/tutoriels/raspberry-timelapse/

@davidplowman
Copy link
Collaborator

Hi, thanks for the report. I set this running overnight as described and indeed, it locked up with 2 Python processes at 100%. Very strange. The bad news is that it took 16 hours to fail! Anyway, so I'll have to gradually start adding some instrumentation, perhaps cut the delay from 10 to 5 seconds, and hope the next run gets a bit "luckier"...

@tutoduino
Copy link
Author

I added debug traces. We see normal behavior first with 10 sec between pictures 136,137,138.
But then an interval of 110 seconds for next picture 139.
Capture time is correct, no specific debug trace to help understanding what happens :

Capture saved: image00136.jpg at time 1371.8
picamera2.picamera2 DEBUG: Execute job: <picamera2.job.Job object at 0x665239b0>
picamera2.request INFO: Saved <picamera2.request.Helpers object at 0x66518fb0> to file image00137.jpg.
picamera2.request INFO: Time taken for encode: 517.1655340000143 ms.
Capture saved: image00137.jpg at time 1381.8
picamera2.picamera2 DEBUG: Execute job: <picamera2.job.Job object at 0x66523990>
picamera2.request INFO: Saved <picamera2.request.Helpers object at 0x66518fb0> to file image00138.jpg.
picamera2.request INFO: Time taken for encode: 517.6941099998658 ms.
Capture saved: image00138.jpg at time 1391.8
picamera2.picamera2 DEBUG: Execute job: <picamera2.job.Job object at 0x66523990>
picamera2.request INFO: Saved <picamera2.request.Helpers object at 0x66518fb0> to file image00139.jpg.
picamera2.request INFO: Time taken for encode: 517.8634320000128 ms.
Capture saved: image00139.jpg at time 1500.4
Timing issue !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

@nzottmann
Copy link

Could you look at dmesg -T output after the lockup if there are any messages at the point of time the script stopped working?

@davidplowman
Copy link
Collaborator

HI @nzottmann , I did try that actually when mine locked up (seeing as it took 16 hours I had a good snoop around before rebooting), and there was absolutely nothing there. Turning on debug in the camera CSI2 receiver showed that the camera was still running but not filling any buffers. Quite why anything would be running at 100% is a bit of a mystery. I found I couldn't even kill the processes, and had to pull the power cord out.

Interestingly, it just locked up again for me, for about 1 minute. I noticed that htop was reporting 100% for those two processes again, so the symptoms look identical to the "permanent" lock-up. Except that after a minute it just carried on as if nothing had happened.

@tutoduino
Copy link
Author

Maybe not linked, but calling dmesg -T in a ssh session created a freeze of other ssh session that was running the capture program (at capture nb 669)... But the frames are still captured (I see pictures recorded correctly until capture nb 730). Something weird with ssh since it is running at 100%...
The capture program stop capturing after capture 730,
No information in dmesg -T
htop2

@tutoduino
Copy link
Author

tutoduino commented Aug 10, 2024

Not able to reproduce yet on bullseye (Operating System: Raspbian GNU/Linux 11 (bullseye), Kernel: Linux 6.1.21-v7+), it looks much more stable under bullseye when another ssh session is opened in parallel.

Version of libcamera on Bullseye :

pi@raspi2w:~ $ libcamera-hello --version
libcamera-apps build: 7e4d3d71867f 17-07-2023 (07:34:37)
libcamera build: v0.0.5+83-bde9b04f

Version of libcamera on Bookworm :

pi@raspi2w:~ $ libcamera-hello --version
rpicam-apps build: 49344f2a8d18 17-06-2024 (12:19:10)
libcamera build: v0.3.0+65-6ddd79b5

@davidplowman
Copy link
Collaborator

When I've been running this and watching it with htop, I frequently find that htop appears to lock up, and go to 100% cpu, while the camera carries on. So it's starting to feel like there's something more general going wrong. I'll keep investigating.

@davidplowman
Copy link
Collaborator

As far as I can tell, this is related to the Linux kernel version. Everything with a 6.1 kernel is fine, and everything with a 6.6 kernel is not. I think it's something to do with how the kernel swaps between processes when resources are getting tight. I don't think it's camera related at all, except insofar as the camera eats system resources in such a way as to provoke problems. This is all several light years outside any areas of expertise that I have, so I'm going to have to ask around.

@tutoduino
Copy link
Author

tutoduino commented Aug 20, 2024

As far as I can tell, this is related to the Linux kernel version. Everything with a 6.1 kernel is fine, and everything with a 6.6 kernel is not. I think it's something to do with how the kernel swaps between processes when resources are getting tight. I don't think it's camera related at all, except insofar as the camera eats system resources in such a way as to provoke problems. This is all several light years outside any areas of expertise that I have, so I'm going to have to ask around.

I tried with latest firmware (using rpi-update) and the kernel 6.6.45 initially seemed to be more stable, but it also lead to the same issue... :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants