Fixing errors in the filesystem #632

Ulrar · 2024-03-27T10:28:05Z

Hi,

Unfortunately still having big issues on every resync which (almost always) lead to corruption in the filesystem. Is there a way to mark a volume as needing an e2fsck before it's next mount ?

The issue is it's always in use, so I can't manually do it. In this specific case it's in use by cloudnative-pg which does not use deployments, so I can't even force it down to 0 replicas to free it up.

Is there a trick to automatically run it, or maybe just to prevent the pvc from being mounted to give me a chance to manually do it ?

Thanks

WanzenBug · 2024-03-27T10:41:59Z

There may be a mount option that does that, but I am not sure.

Data corruption is a serious issue, so if you could please attach an sos-report (linstor sos-report download) so we could assess it.

Ulrar · 2024-03-27T10:55:00Z

Sure of course, here you go.
sos_2024-03-27_10-44-24.tar.gz

My underlying problem is still #579 for which I have no clue. I physically replaced the node that was crashing, so now all three of them do stay up all the time but they blink in an out constantly when a resync is needed. No packet loss but the response times are all over the place with sometimes pauses of a few seconds for some of the nodes.
It still looks to me like the actual sync is fine, once it's going, even if all the volumes are syncing at the same time. The etcd leader keeps changing during the issue, so it's clear the whole node(s) freezes for long enough to trigger an election.

The issue appears during the bitmap calculation (or maybe another step around then that I can't see), with other volumes randomly going through various states like disconnected, unconnected, broken pipe .. and then re-trying. I usually have to disconnect all the volumes myself, and connect them one by one. Once a volume is actually syncing I can move on to the next one, the actual sync is stable and works fine.
All the while fighting the operator that doesn't like having disconnected nodes, but every time it tries to re-connect them all at the same time it causes the whole thing to explode again, so I have to disconnect them and re-connect them one by one quickly enough to be finished before the operator notices.

Because of all those disconnects every time, the quorum keeps getting lost and I often end up in weird states where nodes don't agree on who's UpToDate / Inconsistent. Disconnecting / re-connecting them one by one sometimes clears it, but I also sometimes have to manually pick one. The EXT4 on the volumes themselves almost always ends up with some amount of errors, which is easy enough to fix if the pods can't be scheduled, but once they're running it's a lot trickier to do.

WanzenBug added bug Something isn't working v2 This affects only Operator v2 labels Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing errors in the filesystem #632

Fixing errors in the filesystem #632

Ulrar commented Mar 27, 2024

WanzenBug commented Mar 27, 2024

Ulrar commented Mar 27, 2024 •

edited

Loading

Fixing errors in the filesystem #632

Fixing errors in the filesystem #632

Comments

Ulrar commented Mar 27, 2024

WanzenBug commented Mar 27, 2024

Ulrar commented Mar 27, 2024 • edited Loading

Ulrar commented Mar 27, 2024 •

edited

Loading