Fix sending of WantBlocks messages and tracking of peerWants #1019

benbierens · 2024-12-04T15:04:34Z

In the current situation when we initiate a block request, we send a wantBlock to one peer (even if they don't have the block, so it will be ignored) and we send wantHave messages to others, which are then discarded.

This PR fixes two issues:

We only send wantBlock messages to peers who have the block. We use the same pseudo-random method for selecting a peer if multiple peers have the block.
We track all peerWants instead of only the wantBlocks. This is according to spec, and solves an issue in the case where a block is received/detected after the want-list message.

I did not want to solve both issues in the same PR. But test NetworkStore - multiple nodes in testblockexc.nim failed because of the second issue after I fixed the first one. So instead of having to "break" a correct test, I figured it was better to fix the actual problem.

dryajov

I gave this a review and there are a couple of things that need addressing.

dryajov · 2024-12-04T20:56:44Z

codex/blockexchange/engine/engine.nim

-      codex_block_exchange_want_have_lists_sent.inc()
+      await b.sendWantBlock(@[address], selected)
+
+    await b.sendWantHave(@[address], peers.without)


Ugh, why are we sending a WantHave to the peers that we know don't have the block we need?

Peers are only reporting their haves if asked. We have to send a wantHave before we know if they have the block.

Ah, you're right.

dryajov · 2024-12-04T20:58:58Z

codex/blockexchange/peers/peerctxstore.nim

@@ -32,6 +32,9 @@ logScope:
 type
  PeerCtxStore* = ref object of RootObj
    peers*: OrderedTable[PeerId, BlockExcPeerCtx]
+  PeersForBlock* = ref object of RootObj


Doesn't have to be a ref object (maybe a tuple works better here as well?)

I would prefer object over tuple. After compile it should be very similar, right?

sure, but let's not make it a ref for no reason.

dryajov · 2024-12-04T21:21:32Z

codex/blockexchange/engine/engine.nim

@@ -427,10 +425,8 @@ proc wantListHandler*(
          address: e.address,
          `type`: BlockPresenceType.Have,
          price: price))
-      elif e.wantType == WantType.WantBlock:


There is a good reason why I separated WantBlock into it's own branch. We used to schedule a task for every single WantList update, but obviously this isn't necessary and worst, it was severely affecting performance. We should not attempt to schedule tasks on WantHave, since this is just a presence update.

We track all peerWants instead of only the wantBlocks. This is according to spec, and solves an issue in the case where a block is received/detected after the want-list message.

For more context, this isn't actually true, we do track WantHave messages, but we respond to them immediately with a presence message.

OK, I see the problem with the scheduling of tasks for every wantList. Makes perfect sense and I'll fix that right now.

Regarding the tracking of wantHaves: Responding immediately is good but not enough. We're expected to keep the want in the peerCtx. If we encounter the block later, we should still react to the want with a presence message. The tests in NetworkStore - multiple nodes in testblockexc.nim actually check this behavior perfectly. They were just previously passing because 1) we were sending wantBlocks when we shouldn't and 2) we were keeping the wantBlocks only. So fixing the send of wantBlock caused this correct test to reveal the other issue.

I would like to take a closer look at exactly how the stored wants will trigger messages when blocks are received. But I'm trying to keep the scope of this PR small.

OK, I see. Then, what we need to do is to add these to the BlockExcPeerCtx.peerWants list in addition to responding to the request.

For a later PR, we should take a closer look at how these lists are kept, for example, the peerWants is a simple seq, which doesn't have any sort of caps, nor ordering.

Some lists, like the peerWants, in addition to be capped, should also allow for ordering, for example WantListEntry has a priority field, which should be respected.

For more context, what needs to happen is that when adding a DontHave to the presence seq, we also add it to peerCtx.peerWants list. We only need to remember the DontHaves, because the remote should keep track of the Haves we send back.

Yes: Upgrading the peerWants to a capped, prioritized list is definitely needed.

"what needs to happen is that when adding a DontHave to the presence seq, we also add it to peerCtx.peerWants list."
In the PR, yes we are adding it to the peerWants regardless of Have/DontHave

"We only need to remember the DontHaves, because the remote should keep track of the Haves we send back."
I understand this to mean:

If we Have, don't add the want to the peerCtx, only send presence.

If we dont Have, add want to peerCtx, and send dontHave presence if asked. (e.sendDontHave)

I think bitswap adds the want to the peerCtx in both cases.

For the record, e.sendDontHave is has always been false up to this point except for some tests and that's a good thing because the peerContext peerHave proc doesn't consider the have-flag currently.

If we Have, don't add the want to the peerCtx, only send presence.
If we dont Have, add want to peerCtx, and send dontHave presence if asked. (e.sendDontHave)

Yep, thats what I mean.

benbierens · 2024-12-07T12:09:45Z

OK, I need to update the record with what I've been up to.
After commit "prevent duplicate scheduling. Fixes cancellation" I re-ran the dist-tests and noticed a huge performance deterioration. A transfer that takes 12 seconds on v0.1.9 took 1min-30secs. This is an unacceptable downgrade. So I've set out to discover the cause.

I isolated the slow-down to my commit "Review comment by Dmitriy". Reviewing the changes, the cause remained unexplained. Eventually through a process of reverting/retesting I was able to fix the performance in a late-evening commit fashionably titled "fast". (dcf43d3) After this, performance picked back up, averaging between 8 and 9 seconds, slightly better than 0.1.9. (more tests needed!)

What I learned:
In the wantListHandler, if the want-type is WantHave, we must send a presence response even if the cancel = true. I figured if it is a cancel-message, no need to respond with a presence list. Indeed I later confirmed that these presence-lists sent in response to a cancel are ignored by the peer who receives them. Yet still, somehow, not sending them degrades the transfer performance hugely.

My plan is: Finalize this PR with as few changes as possible. That means leaving this puzzling behavior in place as-is.
After that, I'm going to be adding logs, making changes, and running tests until I can explain why these cancel-triggered presence messages have such an impact. I have a few ideas, but those will have to wait till Monday before I pick this back up.

benbierens · 2024-12-11T08:10:01Z

Further delays/sidetracks:
tests/integration/testsales.nim fails. It is caused (at least sometimes) by a failure of the blockexchange engine. But, the integration tests for block exchange all pass. Enabling logging causes the test to pass. So I'm going to investigate what's wrong here, upgrade the block exchange integration tests if necessary. And hopefully solve the instability. (In a new PR, of course.) Whatever is failing this test might be the same cause as the instability we saw in the testnet: This means that the changes in this PR reveal the issue such that we can capture it in an integration test.
It's kind of annoying that this PR keeps spawning offshoots, but this is the road to more reliability.

# Conflicts: # tests/integration/testsales.nim

benbierens · 2024-12-17T15:19:29Z

Since it's been a week, I created a new image from commit 4b5c355 and reran some dist-tests. It's working and performance is as it should be.

benbierens · 2024-12-20T08:41:49Z

Spoke with Dmitriy about the following change:
Only add the want to the peerWant list if we don't have the block. If we do have the block, respond with a presence, but, no need to remember the want. This change was pushed in 23c0f30

I reran some dist-tests after. Performance and also the cancel-presence message performance impact is unaffected.

benbierens added 3 commits December 4, 2024 16:04

sends wantBlock to peers with block. wantHave to everyone else

3607b88

Cleanup cheapestPeer. Fixes test for peers lists

7c2bff7

Fixes issue where peerWants are only stored for type wantBlock.

64e691b

benbierens changed the title ~~Feature/blkexc peer selection~~ Initial blockRequest messages Dec 4, 2024

benbierens changed the title ~~Initial blockRequest messages~~ Fix sending of WantBlocks messages and tracking of peerWants Dec 4, 2024

benbierens marked this pull request as ready for review December 4, 2024 15:15

benbierens self-assigned this Dec 4, 2024

benbierens added the Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details label Dec 4, 2024

gmega self-requested a review December 4, 2024 17:03

dryajov requested changes Dec 4, 2024

View reviewed changes

Review comments by Dmitriy

ed3e91c

benbierens force-pushed the feature/blkexc-peer-selection branch from d92ec2d to ed3e91c Compare December 5, 2024 09:21

benbierens requested a review from dryajov December 5, 2024 10:18

benbierens and others added 4 commits December 5, 2024 14:40

Merge branch 'master' into feature/blkexc-peer-selection

d32e052

consistent logging of addresses

532b83a

prevents duplicate scheduling. Fixes cancellation

1f063fe

fast

dcf43d3

benbierens and others added 4 commits December 7, 2024 13:09

Merge branch 'master' into feature/blkexc-peer-selection

7e4182e

Marks cancel-presence situation with todo comment.

c3c04e8

Merge branch 'master' into feature/blkexc-peer-selection

cc1121c

fixtest: testsales enable logging

960c010

benbierens mentioned this pull request Dec 11, 2024

Applies multinodessuite in twonodessuite #1031

Merged

benbierens and others added 4 commits December 13, 2024 08:31

Merge branch 'master' into feature/blkexc-peer-selection

9c2f3fa

Merge branch 'master' into feature/blkexc-peer-selection

1ecd784

Merge branch 'master' into feature/blkexc-peer-selection

f0b1e09

Merge branch 'master' into feature/blkexc-peer-selection

4b5c355

# Conflicts: # tests/integration/testsales.nim

Merge branch 'master' into feature/blkexc-peer-selection

a4dc802

Merge branch 'master' into feature/blkexc-peer-selection

edb05f8

benbierens enabled auto-merge December 18, 2024 11:21

benbierens and others added 2 commits December 20, 2024 08:28

Merge branch 'master' into feature/blkexc-peer-selection

55dbcbe

Review by Dmitriy: Remember peerWants only if we don't have them.

23c0f30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sending of WantBlocks messages and tracking of peerWants #1019

Fix sending of WantBlocks messages and tracking of peerWants #1019

benbierens commented Dec 4, 2024 •

edited

Loading

dryajov left a comment

dryajov Dec 4, 2024

benbierens Dec 5, 2024

dryajov Dec 5, 2024

dryajov Dec 4, 2024

benbierens Dec 5, 2024

dryajov Dec 5, 2024

dryajov Dec 4, 2024

dryajov Dec 4, 2024

benbierens Dec 5, 2024

dryajov Dec 5, 2024

dryajov Dec 5, 2024 •

edited

Loading

benbierens Dec 17, 2024 •

edited

Loading

dryajov Dec 17, 2024

benbierens commented Dec 7, 2024

benbierens commented Dec 11, 2024

benbierens commented Dec 17, 2024

benbierens commented Dec 20, 2024

Fix sending of WantBlocks messages and tracking of peerWants #1019

Are you sure you want to change the base?

Fix sending of WantBlocks messages and tracking of peerWants #1019

Conversation

benbierens commented Dec 4, 2024 • edited Loading

dryajov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dryajov Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

benbierens Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benbierens commented Dec 7, 2024

benbierens commented Dec 11, 2024

benbierens commented Dec 17, 2024

benbierens commented Dec 20, 2024

benbierens commented Dec 4, 2024 •

edited

Loading

dryajov Dec 5, 2024 •

edited

Loading

benbierens Dec 17, 2024 •

edited

Loading