-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXPERIMENT] stub test for harmless=true #1555
base: main
Are you sure you want to change the base?
Conversation
pjfanning
commented
Nov 8, 2024
- relates to Clustering issues leading to all nodes being downed #578
- Work in Progress - test does not yet reproduce the issue
- can be refactored into own test if we manage to reproduce the issue
- we may need to modify the test harmless=true test to send a message from one cluster member to the other to cause the shutdown issue
Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala Update OutboundIdleShutdownSpec.scala
Without an active SBR, no node will be shutdown: it is the SBR that downs itself when receiving "eliminate quarantined association when not used (harmless=true)" in withAssociation {
(remoteSystem, remoteAddress, _, localArtery, localProbe) =>
remoteSystem.eventStream.subscribe(testActor, classOf[ThisActorSystemQuarantinedEvent]) // event to watch out for, indicator of the issue
val remoteEcho = remoteSystem.actorSelection("/user/echo").resolveOne(remainingOrDefault).futureValue
val localAddress = RARP(system).provider.getDefaultAddress
val localEchoRef = remoteSystem.actorSelection(RootActorPath(localAddress) / localProbe.ref.path.elements).resolveOne(remainingOrDefault).futureValue
remoteEcho.tell("ping", localEchoRef)
localProbe.expectMsg("ping")
val association = localArtery.association(remoteAddress)
val remoteUid = futureUniqueRemoteAddress(association).futureValue.uid
localArtery.quarantine(remoteAddress, Some(remoteUid), "HarmlessTest", harmless = true)
association.associationState.isQuarantined(remoteUid) shouldBe true
eventually {
remoteEcho.tell("ping", localEchoRef) // trigger sending message from remote to local, which will trigger local to wrongfully notify remote that it is quarantined
expectMsgType[ThisActorSystemQuarantinedEvent] // this is what remote emits when it learns it is quarantined by local. This is not correct and is what (with SBR enabled) triggers killing the node.
}
} |
I added the new test case but I am aware that it needs to be moved to the cluster or cluster-tests projects and the Split Brain Resolver added. I am busy on other tasks so don't expect to get back to this for a while. |
What would it add to move the test to the cluster or cluster-tests projects? To me this is a bug of the |
I've added a change to InboundQuarantineCheck based on #578 (comment). This may not be the best solution but it seems to help in this one test case. |
It seems good to me like that, thank you! |