-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: atomically add and remove shares from a committee #1760
base: stage
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution indeed works at the moment, but there's something I don't like about it. We still have AddShare
and RemoveShare
methods in committee.go
. This means someone can use them without the atomicity you just added. This means if someone add a usage of them in the future, we might have a race condition again.
I can think of 2 options we have here:
- Emphasize in name and comments that these methods are practically unsafe, and that any consumer should use the atomic function in
ValidatorsMap
- Move the atomicity to inside the
Committee
struct, which will inherently stop itself when it has no shares left. this way you can probably check invalidatorsmap
whether the committee has stopped, and if so, remove it from the map. and still keep the functions safe.
I'm leaning more towards 2, what do you guys think? @anatolie-ssv , @MatusKysel , @moshe-blox
6c95e40
to
f0a092c
Compare
I moved the atomicity functionality to the committee object. |
@@ -87,6 +90,9 @@ func (c *Committee) RemoveShare(validatorIndex phase0.ValidatorIndex) { | |||
if share, exist := c.Shares[validatorIndex]; exist { | |||
c.dutyGuard.StopValidator(share.ValidatorPubKey) | |||
delete(c.Shares, validatorIndex) | |||
if len(c.Shares) == 0 { | |||
c.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@y0sher now that we are able to stop the committee should we be checking its state before doing actions like AddShare
, ProcessMessage
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's better (more straightforward) to just call Committee.cancel
once we determined that committee can stop (shares dropped to 0), presumably this should be enough to finish all the work Committee
was doing (while it's controller
's job to make sure no new tasks will be assigned to such Committee
- which it does handle by atomically removing it from controller.validatorsMap
),
and, from my reading of this code, unless we do something with stopped Committee
(have some business logic for it in stopped state, like restarting it) there is no reason to even store this stopped atomic.Bool
(and have corresponding methods stop
and Stopped
) - you could simply remove these and adjust the way UpdateCommitteeAtomic
works (so that instead of checking Stopped()
func mutate
will return a bool
to signal that committee was stopped by this mutation - which will immediately trigger all the cleanup necessary).
7223a0e
to
bea9f9a
Compare
deployed to stage |
9e64123
to
4d86735
Compare
4d86735
to
25f8e3d
Compare
25f8e3d
to
60d5e2e
Compare
0332913
to
9a78188
Compare
func (vm *ValidatorsMap) cleanup() { | ||
var stopped []spectypes.CommitteeID | ||
for k, v := range vm.committees { | ||
if v.Stopped() { | ||
stopped = append(stopped, k) | ||
} | ||
} | ||
for _, k := range stopped { | ||
delete(vm.committees, k) | ||
} | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not something more simple
func (vm *ValidatorsMap) cleanup() {
for k, v := range vm.committees {
if v.Stopped() {
delete(vm.committees, k)
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also do we want to iterate or just clean up one that's been modified in UpdateCommitteeAtomic
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to cleanup, fixed now!
9a78188
to
b008038
Compare
0c1e89e
to
4853472
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code should work, I left couple of nits, but mostly I don't quite understand why we need the concept of "stopped" committee - if it isn't something we are planning to use I'd rather we simplify code by getting rid of it.
operator/validator/controller.go
Outdated
@@ -698,10 +698,9 @@ func (c *controller) UpdateValidatorsMetadata(data map[spectypes.ValidatorPK]*be | |||
if err != nil { | |||
c.logger.Warn("could not start validator", zap.Error(err)) | |||
} | |||
vc, found := c.validatorsMap.GetCommittee(v.Share().CommitteeID()) | |||
if found { | |||
_ = c.validatorsMap.UpdateCommitteeAtomic(v.Share().CommitteeID(), func(vc *validator.Committee) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we log an error or something when UpdateCommitteeAtomic
returns false here (just as a sanity check/thing) ?
operator/validator/controller.go
Outdated
var vc *validator.Committee | ||
|
||
// Start a committee validator. | ||
vc, found := c.validatorsMap.GetCommittee(operator.CommitteeID) | ||
found = c.validatorsMap.UpdateCommitteeAtomic(operator.CommitteeID, func(fvc *validator.Committee) { | ||
fvc.AddShare(&share.Share) | ||
vc = fvc | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cleaner to return validator.Committee
from UpdateCommitteeAtomic
, wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but there are 3 call sites to this function and only this one would benefit from the extra return value. @y0sher @MatusKysel your take here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for returning the value
if !ok || v.Stopped() { | ||
return nil, false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like "stopped" committee can be present in vm.committees
map because we immediately remove stopped committees in UpdateCommitteeAtomic
, so perhaps no need to add this check here ?
Applies to all methods in ValidatorsMap
I think, but only because we don't really "stop" committee from anywhere else other than ValidatorsMap
(via UpdateCommitteeAtomic
-> RemoveShare
) - like I mentioned in #1760 (comment). In theory we could "stop" committee from somewhere else, I just don't see why we would (and it makes code harder to understand or see the whole picture).
deletedCommittee.Stop() | ||
} | ||
}) { | ||
c.logger.Warn("committee not found", fields.PubKey(pubKey[:])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to https://github.com/ssvlabs/ssv/pull/1760/files#r1840850336, probably want to log this as Error
(not Warn
) ? Because if committee is not found here - looks like something is wrong with the way we manage it's lifecycle, and we'd need to address it (warnings in logs are pretty easy to miss, compared to errors).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I remember in past conversations it was decided not to change the log level and keep it as it was previously (line 898)
4853472
to
e771d25
Compare
Co-authored-by: rehs0y <lyosher@gmail.com>
Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com>
Share addition/removal takes place under a single lock.
Close: #1558