Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for corrupted models #3184

Open
caco3 opened this issue Aug 17, 2024 · 4 comments
Open

Check for corrupted models #3184

caco3 opened this issue Aug 17, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@caco3
Copy link
Collaborator

caco3 commented Aug 17, 2024

The Feature

We sporadically see issues that the system crashes after loading a model, eg. #3177 (reply in thread)
Usually this is due a SD card going bad.

I think it would be wise if we would check the models before we use them (and prevent a crash).
The best way would be to handle the issue if it is corrupted. Not sure if this is possible with the use tflite library.
The other way would be to provide a 2nd file per model containing the CRC32 or MD5 sum. The firmware then could check against it and handle it.

@caco3 caco3 added the enhancement New feature or request label Aug 17, 2024
@caco3
Copy link
Collaborator Author

caco3 commented Sep 1, 2024

It crashes here:
this->interpreter = new tflite::MicroInterpreter(this->model, resolver, this->tensor_arena, this->kTensorArenaSize);

https://github.com/jomjol/AI-on-the-edge-device/blob/rolling/code/components/jomjol_tfliteclass/CTfLiteClass.cpp#L208

@caco3 caco3 self-assigned this Sep 1, 2024
@caco3
Copy link
Collaborator Author

caco3 commented Sep 2, 2024

@Slider0007 @SybexX @jomjol Do you have experience with enabling exception handling?
IMO this is the only way to catch the crash which is inside the tflite library.
How ever I am unable to enable exception handling in the platformio.ini file.
What ever I do, I get error: exception handling disabled, use '-fexceptions' to enable but I already replaced -fno-exceptions with -fexceptions...

@Slider0007
Copy link
Collaborator

Slider0007 commented Sep 2, 2024

@caco3: I've never used exception handling in ESP IDF environment. Therefore I cannot assist with this topic. As I understand this correct, this could be tricky because every potential exception all over the software needs a catch otherwise processing is getting aborted in error case. Would potentially a lot of work...

Beside execption handling at least a sort of version check could be added. Maybe this helps a bit depending on how the file is getting corrupted. The question is what should be the reaction because the flow in jomjol firmware cannot be aborted gracefully anyway...

https://github.com/Slider0007/AI-on-the-edge-device/blob/7f14d89bc013f6db145eac343d90b4b457ae11b3/code/components/jomjol_tfliteclass/CTfLiteClass.cpp#L325

Nevertehless I raise the question what will the user do if it's not crashing but still not working anymore because the model is corrupted anyway? This does not solve the root cause, wearing SD cards quickly because of tons of reading cycles of the same files...

@caco3
Copy link
Collaborator Author

caco3 commented Sep 2, 2024

this could be tricky because every potential exception all over the software needs a catch otherwise processing is getting aborted in error case. Would potentially a lot of work

Well, without exceptions (as is now), it goes directly into an abort!

The question is what should be the reaction

The crash happens within the tflite library. The right way would be to patch that one, but I fear it might by a lot of learning first. The other way is to add a try/catch around it. This way we can notify the user gracefully without crashing.
As of now, I don't know how to do it, so the only thing we can do is add a debug log message just before that call. The current implementation is that on a crash, we stay in DEBUG log level and delay the first round by 5 minutes. This way we will see the log message indication the issue. See the example in #3220.

Your proposal with the version check sounds as a good start. but it will not be able to catch all corruptions.

Nevertehless I raise the question what will the user do if it's not crashing but still not working anymore because the model is corrupted anyway?

We simply can show an error in the UI/MQTT, ...

This does not solve the root cause, wearing SD cards quickly because of tons of reading cycles of the same files...

Yes, thats right, but I have the feeling we had quite some bug reports because of corrupted models/filesystems.
Because of this I investigated and saw that there actually is no validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants