-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sending of WantBlocks messages and tracking of peerWants #1019
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a review and there are a couple of things that need addressing.
codex_block_exchange_want_have_lists_sent.inc() | ||
await b.sendWantBlock(@[address], selected) | ||
|
||
await b.sendWantHave(@[address], peers.without) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, why are we sending a WantHave
to the peers that we know don't have the block we need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peers are only reporting their haves if asked. We have to send a wantHave before we know if they have the block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right.
@@ -32,6 +32,9 @@ logScope: | |||
type | |||
PeerCtxStore* = ref object of RootObj | |||
peers*: OrderedTable[PeerId, BlockExcPeerCtx] | |||
PeersForBlock* = ref object of RootObj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't have to be a ref object
(maybe a tuple works better here as well?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer object over tuple. After compile it should be very similar, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but let's not make it a ref for no reason.
@@ -427,10 +425,8 @@ proc wantListHandler*( | |||
address: e.address, | |||
`type`: BlockPresenceType.Have, | |||
price: price)) | |||
elif e.wantType == WantType.WantBlock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a good reason why I separated WantBlock
into it's own branch. We used to schedule a task for every single WantList
update, but obviously this isn't necessary and worst, it was severely affecting performance. We should not attempt to schedule tasks on WantHave
, since this is just a presence update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We track all peerWants instead of only the wantBlocks. This is according to spec, and solves an issue in the case where a block is received/detected after the want-list message.
For more context, this isn't actually true, we do track WantHave
messages, but we respond to them immediately with a presence message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see the problem with the scheduling of tasks for every wantList. Makes perfect sense and I'll fix that right now.
Regarding the tracking of wantHaves: Responding immediately is good but not enough. We're expected to keep the want in the peerCtx. If we encounter the block later, we should still react to the want with a presence message. The tests in NetworkStore - multiple nodes
in testblockexc.nim actually check this behavior perfectly. They were just previously passing because 1) we were sending wantBlocks when we shouldn't and 2) we were keeping the wantBlocks only. So fixing the send of wantBlock caused this correct test to reveal the other issue.
I would like to take a closer look at exactly how the stored wants will trigger messages when blocks are received. But I'm trying to keep the scope of this PR small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. Then, what we need to do is to add these to the BlockExcPeerCtx.peerWants
list in addition to responding to the request.
For a later PR, we should take a closer look at how these lists are kept, for example, the peerWants
is a simple seq
, which doesn't have any sort of caps, nor ordering.
Some lists, like the peerWants
, in addition to be capped, should also allow for ordering, for example WantListEntry
has a priority
field, which should be respected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more context, what needs to happen is that when adding a DontHave
to the presence
seq, we also add it to peerCtx.peerWants
list. We only need to remember the DontHave
s, because the remote should keep track of the Have
s we send back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes: Upgrading the peerWants to a capped, prioritized list is definitely needed.
"what needs to happen is that when adding a DontHave to the presence seq, we also add it to peerCtx.peerWants list."
In the PR, yes we are adding it to the peerWants regardless of Have/DontHave
"We only need to remember the DontHaves, because the remote should keep track of the Haves we send back."
I understand this to mean:
- If we Have, don't add the want to the peerCtx, only send presence.
- If we dont Have, add want to peerCtx, and send dontHave presence if asked. (e.sendDontHave)
I think bitswap adds the want to the peerCtx in both cases.
For the record, e.sendDontHave is has always been false up to this point except for some tests and that's a good thing because the peerContext peerHave proc doesn't consider the have-flag currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we Have, don't add the want to the peerCtx, only send presence.
If we dont Have, add want to peerCtx, and send dontHave presence if asked. (e.sendDontHave)
Yep, thats what I mean.
d92ec2d
to
ed3e91c
Compare
OK, I need to update the record with what I've been up to. I isolated the slow-down to my commit "Review comment by Dmitriy". Reviewing the changes, the cause remained unexplained. Eventually through a process of reverting/retesting I was able to fix the performance in a late-evening commit fashionably titled "fast". (dcf43d3) After this, performance picked back up, averaging between 8 and 9 seconds, slightly better than 0.1.9. (more tests needed!) What I learned: My plan is: Finalize this PR with as few changes as possible. That means leaving this puzzling behavior in place as-is. |
Further delays/sidetracks: |
# Conflicts: # tests/integration/testsales.nim
Since it's been a week, I created a new image from commit 4b5c355 and reran some dist-tests. It's working and performance is as it should be. |
Spoke with Dmitriy about the following change: I reran some dist-tests after. Performance and also the cancel-presence message performance impact is unaffected. |
In the current situation when we initiate a block request, we send a wantBlock to one peer (even if they don't have the block, so it will be ignored) and we send wantHave messages to others, which are then discarded.
This PR fixes two issues:
I did not want to solve both issues in the same PR. But test
NetworkStore - multiple nodes
in testblockexc.nim failed because of the second issue after I fixed the first one. So instead of having to "break" a correct test, I figured it was better to fix the actual problem.