Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: atomically add and remove shares from a committee #1760

Open
wants to merge 12 commits into
base: stage
Choose a base branch
from

Conversation

anatolie-ssv
Copy link

@anatolie-ssv anatolie-ssv commented Sep 24, 2024

Share addition/removal takes place under a single lock.

Close: #1558

@anatolie-ssv anatolie-ssv self-assigned this Sep 24, 2024
@anatolie-ssv anatolie-ssv added bug Something isn't working and removed bug Something isn't working labels Sep 24, 2024
operator/validator/controller.go Outdated Show resolved Hide resolved
@anatolie-ssv anatolie-ssv changed the title fix: atomically add and remove shares from a commitee fix: atomically add and remove shares from a committee Sep 24, 2024
@anatolie-ssv anatolie-ssv linked an issue Sep 24, 2024 that may be closed by this pull request
Copy link
Contributor

@y0sher y0sher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a test

y0sher

This comment was marked as outdated.

@y0sher y0sher dismissed their stale review September 25, 2024 20:19

more comments

Copy link
Contributor

@y0sher y0sher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution indeed works at the moment, but there's something I don't like about it. We still have AddShare and RemoveShare methods in committee.go . This means someone can use them without the atomicity you just added. This means if someone add a usage of them in the future, we might have a race condition again.
I can think of 2 options we have here:

  1. Emphasize in name and comments that these methods are practically unsafe, and that any consumer should use the atomic function in ValidatorsMap
  2. Move the atomicity to inside the Committee struct, which will inherently stop itself when it has no shares left. this way you can probably check in validatorsmap whether the committee has stopped, and if so, remove it from the map. and still keep the functions safe.

I'm leaning more towards 2, what do you guys think? @anatolie-ssv , @MatusKysel , @moshe-blox

@anatolie-ssv anatolie-ssv force-pushed the fix/add-share-rc branch 4 times, most recently from 6c95e40 to f0a092c Compare September 26, 2024 09:41
@anatolie-ssv anatolie-ssv changed the title fix: atomically add and remove shares from a committee Atomically add and remove shares from a committee Sep 30, 2024
@anatolie-ssv anatolie-ssv changed the title Atomically add and remove shares from a committee fix: atomically add and remove shares from a committee Sep 30, 2024
@anatolie-ssv
Copy link
Author

This solution indeed works at the moment, but there's something I don't like about it. We still have AddShare and RemoveShare methods in committee.go . This means someone can use them without the atomicity you just added. This means if someone add a usage of them in the future, we might have a race condition again. I can think of 2 options we have here:

1. Emphasize in name and comments that these methods are practically unsafe, and that any consumer should use the atomic function in `ValidatorsMap`

2. Move the atomicity to inside the `Committee` struct, which will inherently stop itself when it has no shares left. this way you can probably check in `validatorsmap` whether the committee has stopped, and if so, remove it from the map. and still keep the functions safe.

I'm leaning more towards 2, what do you guys think? @anatolie-ssv , @MatusKysel , @moshe-blox

I moved the atomicity functionality to the committee object.

@@ -87,6 +90,9 @@ func (c *Committee) RemoveShare(validatorIndex phase0.ValidatorIndex) {
if share, exist := c.Shares[validatorIndex]; exist {
c.dutyGuard.StopValidator(share.ValidatorPubKey)
delete(c.Shares, validatorIndex)
if len(c.Shares) == 0 {
c.stop()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@y0sher now that we are able to stop the committee should we be checking its state before doing actions like AddShare, ProcessMessage etc?

Copy link
Contributor

@iurii-ssv iurii-ssv Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's better (more straightforward) to just call Committee.cancel once we determined that committee can stop (shares dropped to 0), presumably this should be enough to finish all the work Committee was doing (while it's controller's job to make sure no new tasks will be assigned to such Committee - which it does handle by atomically removing it from controller.validatorsMap),

and, from my reading of this code, unless we do something with stopped Committee (have some business logic for it in stopped state, like restarting it) there is no reason to even store this stopped atomic.Bool (and have corresponding methods stop and Stopped) - you could simply remove these and adjust the way UpdateCommitteeAtomic works (so that instead of checking Stopped() func mutate will return a bool to signal that committee was stopped by this mutation - which will immediately trigger all the cleanup necessary).

@anatolie-ssv
Copy link
Author

anatolie-ssv commented Oct 3, 2024

deployed to stage 311-314 epoch 83532 to 83652

@anatolie-ssv anatolie-ssv force-pushed the fix/add-share-rc branch 2 times, most recently from 9e64123 to 4d86735 Compare October 10, 2024 10:33
Comment on lines 182 to 178
func (vm *ValidatorsMap) cleanup() {
var stopped []spectypes.CommitteeID
for k, v := range vm.committees {
if v.Stopped() {
stopped = append(stopped, k)
}
}
for _, k := range stopped {
delete(vm.committees, k)
}
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not something more simple

func (vm *ValidatorsMap) cleanup() {
	for k, v := range vm.committees {
		if v.Stopped() {
			delete(vm.committees, k)
		}
	}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also do we want to iterate or just clean up one that's been modified in UpdateCommitteeAtomic?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to cleanup, fixed now!

operator/validators/validators_map_test.go Outdated Show resolved Hide resolved
operator/validator/controller.go Outdated Show resolved Hide resolved
operator/validators/validators_map_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@iurii-ssv iurii-ssv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code should work, I left couple of nits, but mostly I don't quite understand why we need the concept of "stopped" committee - if it isn't something we are planning to use I'd rather we simplify code by getting rid of it.

@@ -698,10 +698,9 @@ func (c *controller) UpdateValidatorsMetadata(data map[spectypes.ValidatorPK]*be
if err != nil {
c.logger.Warn("could not start validator", zap.Error(err))
}
vc, found := c.validatorsMap.GetCommittee(v.Share().CommitteeID())
if found {
_ = c.validatorsMap.UpdateCommitteeAtomic(v.Share().CommitteeID(), func(vc *validator.Committee) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we log an error or something when UpdateCommitteeAtomic returns false here (just as a sanity check/thing) ?

Comment on lines 960 to 965
var vc *validator.Committee

// Start a committee validator.
vc, found := c.validatorsMap.GetCommittee(operator.CommitteeID)
found = c.validatorsMap.UpdateCommitteeAtomic(operator.CommitteeID, func(fvc *validator.Committee) {
fvc.AddShare(&share.Share)
vc = fvc
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cleaner to return validator.Committee from UpdateCommitteeAtomic, wdyt ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but there are 3 call sites to this function and only this one would benefit from the extra return value. @y0sher @MatusKysel your take here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for returning the value

Comment on lines +144 to +146
if !ok || v.Stopped() {
return nil, false
}
Copy link
Contributor

@iurii-ssv iurii-ssv Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like "stopped" committee can be present in vm.committees map because we immediately remove stopped committees in UpdateCommitteeAtomic, so perhaps no need to add this check here ?

Applies to all methods in ValidatorsMap I think, but only because we don't really "stop" committee from anywhere else other than ValidatorsMap (via UpdateCommitteeAtomic -> RemoveShare) - like I mentioned in #1760 (comment). In theory we could "stop" committee from somewhere else, I just don't see why we would (and it makes code harder to understand or see the whole picture).

deletedCommittee.Stop()
}
}) {
c.logger.Warn("committee not found", fields.PubKey(pubKey[:]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to https://github.com/ssvlabs/ssv/pull/1760/files#r1840850336, probably want to log this as Error (not Warn) ? Because if committee is not found here - looks like something is wrong with the way we manage it's lifecycle, and we'd need to address it (warnings in logs are pretty easy to miss, compared to errors).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I remember in past conversations it was decided not to change the log level and keep it as it was previously (line 898)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rethink committee share management
5 participants