Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use parent process to write db and child to detect deadlock instead o… #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

glennhickey
Copy link

When trying to run this in cactus, I keep getting deadlocks whenever it tries to write a snapshot.

The problem seems to be the function TimedDB::dump_snapshot_atomic making a fork() (after checking the db type, StashDB in this case is "forkable"). The child process goes on to write the snapshot, and the parent hangs around waiting for it to complete. If the child crashes or hangs, the parent logs an error and returns false.

But I think there's a bug somewhere where where the parent thread isn't locking everything it needs to, in which case fork's results are undefined. So when the child goes to lock a mutex it waits forever. Rather than spend all year looking for the root problem, I'm just flipping the roles here: Have the parent write the db and the child hang around waiting for a deadlock. This seems to fix the issue I'm seeing. Side effects will be

  • In the event of ktserver crashing during the db-snapshot, the whole process will exit -- whereas before it'd log an error.
  • I've boosted up the wait time for deadlocks considerably.
  • There may be some kind of crazy race condition causing an error if the snapshot hangs exactly 2.7ish hours (to the microsecond) after writing successfully.
    I think that all this is fine for the purposes of cactus, but will try on a big example before merging this to master here and and cactus.

…f vice versa to avoid hanging on ubuntu 18.04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant