Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chain: fix NeutrinoClient segfault on NotifyReceived call #822

Merged
merged 6 commits into from
May 10, 2023

Conversation

MStreet3
Copy link
Contributor

@MStreet3 MStreet3 commented Oct 26, 2022

Fixes #819
Replaces #820

Background

There is datarace in the neutrino client implementation that is causing test
flakes in lnd. The shared rescan object and the granular
locking strategy allow for the possibility of a segmentation fault when
NotifyReceived attempts to call Update on a rescan that is nil. This
is because the clientMtx lock is released after rescan is set to nil.

Changes

The solution here is to be minimally invasive. Add a new rescanMtx mutex
and require that the NotifyReceived method and the Rescan method
hold the rescanMtx lock for the duration of their execution. Locking and
unlocking of the client is unaffected and changes to the rescan object
become atomic.

Testing

This PR adds two interfaces rescanner and nutrinoChainService
to abstract out the dependency on structs from the neutrino package.

This PR adds a file of mock implementations for these interfaces and uses them
in a new unit test file.

@MStreet3 MStreet3 mentioned this pull request Oct 26, 2022
@MStreet3 MStreet3 changed the title Bug/rescan data race Fix NeutrinoClient segfault on NotifyReceived call Oct 26, 2022
@MStreet3 MStreet3 changed the title Fix NeutrinoClient segfault on NotifyReceived call chain: fix NeutrinoClient segfault on NotifyReceived call Oct 26, 2022
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for re-structuring the PR! I left a few suggestions for how to improve it even more for easier review.
But I was able to understand all the changes, nice work! Sorry it took so long to get to the first pass review.

I think we can further reduce the diff by passing in the newRescanFunc in the NewNeutrinoClient. That removes the need for any type casting or having two versions of the cs member around. See my diff with the proposed changes (might need some additional cleanup, this is just to get my idea across):

pr-822.diff.txt

chain/internal/rescan/rescan.go Outdated Show resolved Hide resolved
chain/neutrino.go Outdated Show resolved Hide resolved
chain/neutrino.go Outdated Show resolved Hide resolved
chain/mocks_test.go Outdated Show resolved Hide resolved
chain/mocks_test.go Outdated Show resolved Hide resolved
chain/neutrino.go Show resolved Hide resolved
chain/internal/rescan/rescan.go Outdated Show resolved Hide resolved
chain/neutrino_test.go Outdated Show resolved Hide resolved
chain/neutrino.go Outdated Show resolved Hide resolved
chain/neutrino.go Outdated Show resolved Hide resolved
@MStreet3 MStreet3 force-pushed the bug/rescan-data-race branch 5 times, most recently from 4d87ccb to 45a359f Compare November 18, 2022 02:55
@MStreet3 MStreet3 requested a review from guggero November 18, 2022 02:56
@MStreet3 MStreet3 force-pushed the bug/rescan-data-race branch from 45a359f to 54710ef Compare November 20, 2022 14:38
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, only nits left 🎉 Thanks for all your work on this!

btcwallet.go Outdated Show resolved Hide resolved
chain/neutrino.go Outdated Show resolved Hide resolved
chain/rescan.go Show resolved Hide resolved
chain/mocks_test.go Outdated Show resolved Hide resolved
chain/neutrino_test.go Outdated Show resolved Hide resolved
chain/neutrino_test.go Outdated Show resolved Hide resolved
chain/neutrino_test.go Outdated Show resolved Hide resolved
chain/neutrino.go Show resolved Hide resolved
@guggero
Copy link
Collaborator

guggero commented Nov 25, 2022

@ellemouton could you take a look at this as well please?

@MStreet3 MStreet3 force-pushed the bug/rescan-data-race branch 10 times, most recently from a175d3f to baead51 Compare November 26, 2022 16:47
@MStreet3 MStreet3 requested a review from guggero November 26, 2022 16:50
Copy link
Contributor

@chappjc chappjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rescanMtx resolution seems fine, but this PR introduces some notable and breaking API changes and abstraction, which create some challenges and confusion for consumers.

The abstraction seems unnecessary, except for test stubbing, but even then I don't think the constructors and exported types need to change. Namely, the test code can and does set the fields directly with its mock stubs (without using the constructors), since e.g. newMockNeutrinoClient manually creates the NeutrinoClient, and the newly added NewRescanFunc isn't used. Could the exported constructor, which consumers like the btcwallet app use, stay the same as before but deal with setting up the rescanner fields from the provided chain service?

This is only from a cursory scan, so I apologize in advance if my assessment is off base, but I don't think this improves the API in its present state, and I suspect the goals of testing with a mock chain service can still be achieved without the API changes.

chain/neutrino.go Outdated Show resolved Hide resolved
chain/neutrino.go Outdated Show resolved Hide resolved
@MStreet3 MStreet3 force-pushed the bug/rescan-data-race branch 3 times, most recently from 0b24151 to a6507c0 Compare November 27, 2022 00:55
@MStreet3 MStreet3 removed the request for review from guggero November 27, 2022 00:57
@MStreet3 MStreet3 force-pushed the bug/rescan-data-race branch 2 times, most recently from 47929e7 to 9d91d11 Compare December 2, 2022 01:54
@MStreet3 MStreet3 requested a review from guggero December 2, 2022 01:57
@MStreet3 MStreet3 force-pushed the bug/rescan-data-race branch 3 times, most recently from df17f40 to 9498860 Compare December 2, 2022 03:48
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, LGTM 🎉

chain/rescan.go Outdated Show resolved Hide resolved
chain/neutrino_test.go Outdated Show resolved Hide resolved
@MStreet3 MStreet3 force-pushed the bug/rescan-data-race branch from 9498860 to 49e8366 Compare December 2, 2022 11:38
@MStreet3 MStreet3 requested a review from guggero December 2, 2022 11:39
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for all the quick updates on this. LGTM 🎉
Before we merge this, can you please update (or create?) the PR on the lnd side to make sure all neutrino integration tests still pass?

@MStreet3
Copy link
Contributor Author

MStreet3 commented Dec 2, 2022

Thanks a lot for all the quick updates on this. LGTM 🎉

Before we merge this, can you please update (or create?) the PR on the lnd side to make sure all neutrino integration tests still pass?

Yea will do.

@MStreet3 MStreet3 force-pushed the bug/rescan-data-race branch from c5e8a86 to 53904c1 Compare December 2, 2022 12:16
@MStreet3
Copy link
Contributor Author

MStreet3 commented Dec 2, 2022

@guggero the corresponding LND PR with updated btcwallet is here: lightningnetwork/lnd#7049

@MStreet3 MStreet3 force-pushed the bug/rescan-data-race branch from 53904c1 to 12cb2a6 Compare December 11, 2022 14:17
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Love the new abstractions that make things way more testable. Great job!

@guggero
Copy link
Collaborator

guggero commented May 8, 2023

@MStreet3 can you do a rebase please?
Also, @chappjc could I ask you for another round of review please, since you did a first round a while ago.

Copy link
Contributor

@chappjc chappjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still builds and runs fine in my use case.

With limited familiarity with btcwallet's code, the atomicity changes look fine too. I don't see any lock ordering pitfalls that could lead to deadlock, but it's a risk of nested locking.

I think I see some ancient data races on the rescanQuit field in the notification handlers like onBlockConnected, but that's unrelated to the issue being solved in this PR.

MStreet3 added 6 commits May 9, 2023 22:35
Define an abstract rescanner interface and a
newRescanFunc constructor type.  Define
a constructor adapter that adapts a
neutrino.NewRescan constructor into the
required type.  Replace direct calls in the
NeutrinoClient to the neutrino.NewRescan
constructor with calls to the abstract type.
Define NeutrinoChainService interface to provide
abstract access to all the public methods defined
for a *neutrino.ChainService object.  The interface
defines all public methods to ensure downstream
consumers of the package maintain compatibility as
the previous reference to a *neutrino.ChainService
is exposed publicly on the NeutrinoClient.
Add TestNeutrinoClientSequentialStartStop to verify that the
neutrino client can have Start() and Stop() called sequentially.
Test case fails with -race flag enabled.

Add TestNeutrinoClientNotifyReceived to verify that calls to
NotifyReceived call the Update method of the rescan object.  Verify
that there is no race condition on any state variables of the
NeutrinoClient.  Pass test with -race flag enabled.

Add a unit test that demonstrates a segmentation fault exists when
concurrent calls to NotifyReceived, NotifyBlocks and Rescan methods of
the NeutrinoClient are executed.  The rescan property is set to nil
and a segmentation fault arises.
Add a mutex to synchronize access to the rescan prop of the
neutrino client.  Make calls to Rescan and NotifyReceived
require the rescan mutex lock to execute.  Resolve the segmentation
fault in the TestNeutrinoClientNotifyReceivedRescan unit test.
Resolve the data race in the sequential start stop unit test
by ensuring that each goroutine launched in the Start method
is properly accounted for in the client's waitgroup.

Add a TODO to shutdown the rescan goroutine when the Stop
method is called.
@MStreet3 MStreet3 force-pushed the bug/rescan-data-race branch from 12cb2a6 to 8c31629 Compare May 10, 2023 02:36
@MStreet3 MStreet3 requested a review from guggero May 10, 2023 02:41
@guggero guggero merged commit 4383930 into btcsuite:master May 10, 2023
buck54321 pushed a commit to buck54321/btcwallet that referenced this pull request Apr 21, 2024
chain: fix NeutrinoClient segfault on NotifyReceived call
buck54321 pushed a commit to buck54321/btcwallet that referenced this pull request Apr 21, 2024
chain: fix NeutrinoClient segfault on NotifyReceived call
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NeutrinoClient data race
4 participants