Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dist blocker #51

Merged
merged 9 commits into from
Mar 8, 2024
Merged

Dist blocker #51

merged 9 commits into from
Mar 8, 2024

Conversation

arcusfelis
Copy link
Contributor

@arcusfelis arcusfelis commented Mar 6, 2024

In this PR we introduce cets_dist_blocker, which will set cookies that would prevent nodes from reconnecting before cleaning is done.

We use set_cookie/2, which allows to set a particular cookie for a specific remote node (and specific local node). If we set cookie to some random value, the node would not be able to connect to us (also we would not able to connect). But the existing connection will remain. Also, it only affects that specific node, other nodes would use cookie provided by get_cookie/0 function.

The idea is:

  • on nodeup we set cookie to something different for that particular node, so the next time it tries to reconnect, it will not able to.
  • on nodedown we would wait for all the cleaner processes to ack that they've finished handling nodedown message (sent using monitor_nodes call).
  • we only do it on a local node (i.e. no coordination between dist_blockers in cluster). Each MongooseIM node has a dist_blocker process though, just no messaging between them. Reason: simplicity. Also, that approach is enough to fix most of the issues.

This PR addresses "sessionCount is incorrect during upgrades".
The final goal is to make global's prevent_overlapped_partitions reliable.

Proposed changes include:

  • cets_dist_blocker server, which provides API to block node from reconnecting, unless we do cleaning first.
  • Test helper to start peer nodes.

PR to MongooseIM: esl/MongooseIM#4234
PR to MongooseHelm: esl/MongooseHelm#38

Copy link

codecov bot commented Mar 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.79%. Comparing base (fbaf1d8) to head (b14d8b8).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #51      +/-   ##
==========================================
+ Coverage   98.35%   98.79%   +0.43%     
==========================================
  Files          10       11       +1     
  Lines         792      831      +39     
==========================================
+ Hits          779      821      +42     
+ Misses         13       10       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…s_done

Add dist_blocker_skip_blocking_if_no_cleaners testcase
dist_blocker_blocks_if_cleaner_says_done_and_second_cleaner_does_not_ack testcase
Common connect_and_disconnect function
Copy link
Member

@chrzaszcz chrzaszcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. My only concern is how to protect against the cleaner taking too long on some node. Some timeout maybe? But let's not make it too complex at once.

src/cets_dist_blocker.erl Outdated Show resolved Hide resolved
src/cets_dist_blocker.erl Outdated Show resolved Hide resolved
@arcusfelis
Copy link
Contributor Author

Looks good. My only concern is how to protect against the cleaner taking too long on some node. Some timeout maybe? But let's not make it too complex at once.

Could be timeout. Could be a very noisy logging. With timeout allowing node to connect would lead to the "unusual state" (i.e. untested state). It could be ok, it could be just adding more load to the cluster and causing errors.

In this case there is a chance that there would be "that one node, that prevented the join".
But it is fine-tuning, core functionality is here.

Currently we have:

  • nodes are getting cleaned one-by-one.
  • cleaning would take some reasonably short amount of time. Though not always, we have code that calls mod_last to update last_seen timestamp from SM on cleanup.
  • there is a chance that node would be able to connect to one of the nodes that finished cleaning, and not connect to other nodes. Some coordination could be required. Could make a small test and an issue later on, but it is a separate issue.
  • The idea is to have the basic dist_cleaner, and then maybe we can add cross-node cleaners (i.e. node would set cookie to unblocked on all nodes at once, if at least one node sets cookie to blocked - all nodes in the cluster would set to blocked).

Oh, I think there is 30 seconds grace period during which cets_discovery would not try to reconnect - so, it should be enough for the cluster to be able to accept the node again. Maybe it could be enough.

@chrzaszcz chrzaszcz merged commit 0f69ec9 into main Mar 8, 2024
9 checks passed
@chrzaszcz chrzaszcz deleted the dist-blocker branch March 8, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants