Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing errors in the filesystem #632

Open
Ulrar opened this issue Mar 27, 2024 · 2 comments
Open

Fixing errors in the filesystem #632

Ulrar opened this issue Mar 27, 2024 · 2 comments
Labels
bug Something isn't working v2 This affects only Operator v2

Comments

@Ulrar
Copy link

Ulrar commented Mar 27, 2024

Hi,

Unfortunately still having big issues on every resync which (almost always) lead to corruption in the filesystem. Is there a way to mark a volume as needing an e2fsck before it's next mount ?

The issue is it's always in use, so I can't manually do it. In this specific case it's in use by cloudnative-pg which does not use deployments, so I can't even force it down to 0 replicas to free it up.

Is there a trick to automatically run it, or maybe just to prevent the pvc from being mounted to give me a chance to manually do it ?

Thanks

@WanzenBug
Copy link
Member

There may be a mount option that does that, but I am not sure.

Data corruption is a serious issue, so if you could please attach an sos-report (linstor sos-report download) so we could assess it.

@WanzenBug WanzenBug added bug Something isn't working v2 This affects only Operator v2 labels Mar 27, 2024
@Ulrar
Copy link
Author

Ulrar commented Mar 27, 2024

Sure of course, here you go.
sos_2024-03-27_10-44-24.tar.gz

My underlying problem is still #579 for which I have no clue. I physically replaced the node that was crashing, so now all three of them do stay up all the time but they blink in an out constantly when a resync is needed. No packet loss but the response times are all over the place with sometimes pauses of a few seconds for some of the nodes.
It still looks to me like the actual sync is fine, once it's going, even if all the volumes are syncing at the same time. The etcd leader keeps changing during the issue, so it's clear the whole node(s) freezes for long enough to trigger an election.

The issue appears during the bitmap calculation (or maybe another step around then that I can't see), with other volumes randomly going through various states like disconnected, unconnected, broken pipe .. and then re-trying. I usually have to disconnect all the volumes myself, and connect them one by one. Once a volume is actually syncing I can move on to the next one, the actual sync is stable and works fine.
All the while fighting the operator that doesn't like having disconnected nodes, but every time it tries to re-connect them all at the same time it causes the whole thing to explode again, so I have to disconnect them and re-connect them one by one quickly enough to be finished before the operator notices.

Because of all those disconnects every time, the quorum keeps getting lost and I often end up in weird states where nodes don't agree on who's UpToDate / Inconsistent. Disconnecting / re-connecting them one by one sometimes clears it, but I also sometimes have to manually pick one. The EXT4 on the volumes themselves almost always ends up with some amount of errors, which is easy enough to fix if the pods can't be scheduled, but once they're running it's a lot trickier to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2 This affects only Operator v2
Projects
None yet
Development

No branches or pull requests

2 participants