Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sending of WantBlocks messages and tracking of peerWants #1019

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

benbierens
Copy link
Contributor

@benbierens benbierens commented Dec 4, 2024

In the current situation when we initiate a block request, we send a wantBlock to one peer (even if they don't have the block, so it will be ignored) and we send wantHave messages to others, which are then discarded.

This PR fixes two issues:

  • We only send wantBlock messages to peers who have the block. We use the same pseudo-random method for selecting a peer if multiple peers have the block.
  • We track all peerWants instead of only the wantBlocks. This is according to spec, and solves an issue in the case where a block is received/detected after the want-list message.

I did not want to solve both issues in the same PR. But test NetworkStore - multiple nodes in testblockexc.nim failed because of the second issue after I fixed the first one. So instead of having to "break" a correct test, I figured it was better to fix the actual problem.

@benbierens benbierens changed the title Feature/blkexc peer selection Initial blockRequest messages Dec 4, 2024
@benbierens benbierens changed the title Initial blockRequest messages Fix sending of WantBlocks messages and tracking of peerWants Dec 4, 2024
@benbierens benbierens marked this pull request as ready for review December 4, 2024 15:15
@benbierens benbierens self-assigned this Dec 4, 2024
@benbierens benbierens added the Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details label Dec 4, 2024
@gmega gmega self-requested a review December 4, 2024 17:03
Copy link
Contributor

@dryajov dryajov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a review and there are a couple of things that need addressing.

codex_block_exchange_want_have_lists_sent.inc()
await b.sendWantBlock(@[address], selected)

await b.sendWantHave(@[address], peers.without)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, why are we sending a WantHave to the peers that we know don't have the block we need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peers are only reporting their haves if asked. We have to send a wantHave before we know if they have the block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right.

@@ -32,6 +32,9 @@ logScope:
type
PeerCtxStore* = ref object of RootObj
peers*: OrderedTable[PeerId, BlockExcPeerCtx]
PeersForBlock* = ref object of RootObj
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be a ref object (maybe a tuple works better here as well?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer object over tuple. After compile it should be very similar, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but let's not make it a ref for no reason.

@@ -427,10 +425,8 @@ proc wantListHandler*(
address: e.address,
`type`: BlockPresenceType.Have,
price: price))
elif e.wantType == WantType.WantBlock:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a good reason why I separated WantBlock into it's own branch. We used to schedule a task for every single WantList update, but obviously this isn't necessary and worst, it was severely affecting performance. We should not attempt to schedule tasks on WantHave, since this is just a presence update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We track all peerWants instead of only the wantBlocks. This is according to spec, and solves an issue in the case where a block is received/detected after the want-list message.

For more context, this isn't actually true, we do track WantHave messages, but we respond to them immediately with a presence message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see the problem with the scheduling of tasks for every wantList. Makes perfect sense and I'll fix that right now.

Regarding the tracking of wantHaves: Responding immediately is good but not enough. We're expected to keep the want in the peerCtx. If we encounter the block later, we should still react to the want with a presence message. The tests in NetworkStore - multiple nodes in testblockexc.nim actually check this behavior perfectly. They were just previously passing because 1) we were sending wantBlocks when we shouldn't and 2) we were keeping the wantBlocks only. So fixing the send of wantBlock caused this correct test to reveal the other issue.

I would like to take a closer look at exactly how the stored wants will trigger messages when blocks are received. But I'm trying to keep the scope of this PR small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. Then, what we need to do is to add these to the BlockExcPeerCtx.peerWants list in addition to responding to the request.


For a later PR, we should take a closer look at how these lists are kept, for example, the peerWants is a simple seq, which doesn't have any sort of caps, nor ordering.

Some lists, like the peerWants, in addition to be capped, should also allow for ordering, for example WantListEntry has a priority field, which should be respected.

Copy link
Contributor

@dryajov dryajov Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more context, what needs to happen is that when adding a DontHave to the presence seq, we also add it to peerCtx.peerWants list. We only need to remember the DontHaves, because the remote should keep track of the Haves we send back.

Copy link
Contributor Author

@benbierens benbierens Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: Upgrading the peerWants to a capped, prioritized list is definitely needed.

"what needs to happen is that when adding a DontHave to the presence seq, we also add it to peerCtx.peerWants list."
In the PR, yes we are adding it to the peerWants regardless of Have/DontHave

"We only need to remember the DontHaves, because the remote should keep track of the Haves we send back."
I understand this to mean:

  • If we Have, don't add the want to the peerCtx, only send presence.
  • If we dont Have, add want to peerCtx, and send dontHave presence if asked. (e.sendDontHave)

I think bitswap adds the want to the peerCtx in both cases.

For the record, e.sendDontHave is has always been false up to this point except for some tests and that's a good thing because the peerContext peerHave proc doesn't consider the have-flag currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we Have, don't add the want to the peerCtx, only send presence.
If we dont Have, add want to peerCtx, and send dontHave presence if asked. (e.sendDontHave)

Yep, thats what I mean.

@benbierens benbierens force-pushed the feature/blkexc-peer-selection branch from d92ec2d to ed3e91c Compare December 5, 2024 09:21
@benbierens benbierens requested a review from dryajov December 5, 2024 10:18
@benbierens
Copy link
Contributor Author

OK, I need to update the record with what I've been up to.
After commit "prevent duplicate scheduling. Fixes cancellation" I re-ran the dist-tests and noticed a huge performance deterioration. A transfer that takes 12 seconds on v0.1.9 took 1min-30secs. This is an unacceptable downgrade. So I've set out to discover the cause.

I isolated the slow-down to my commit "Review comment by Dmitriy". Reviewing the changes, the cause remained unexplained. Eventually through a process of reverting/retesting I was able to fix the performance in a late-evening commit fashionably titled "fast". (dcf43d3) After this, performance picked back up, averaging between 8 and 9 seconds, slightly better than 0.1.9. (more tests needed!)

What I learned:
In the wantListHandler, if the want-type is WantHave, we must send a presence response even if the cancel = true. I figured if it is a cancel-message, no need to respond with a presence list. Indeed I later confirmed that these presence-lists sent in response to a cancel are ignored by the peer who receives them. Yet still, somehow, not sending them degrades the transfer performance hugely.

My plan is: Finalize this PR with as few changes as possible. That means leaving this puzzling behavior in place as-is.
After that, I'm going to be adding logs, making changes, and running tests until I can explain why these cancel-triggered presence messages have such an impact. I have a few ideas, but those will have to wait till Monday before I pick this back up.

@benbierens
Copy link
Contributor Author

Further delays/sidetracks:
tests/integration/testsales.nim fails. It is caused (at least sometimes) by a failure of the blockexchange engine. But, the integration tests for block exchange all pass. Enabling logging causes the test to pass. So I'm going to investigate what's wrong here, upgrade the block exchange integration tests if necessary. And hopefully solve the instability. (In a new PR, of course.) Whatever is failing this test might be the same cause as the instability we saw in the testnet: This means that the changes in this PR reveal the issue such that we can capture it in an integration test.
It's kind of annoying that this PR keeps spawning offshoots, but this is the road to more reliability.

@benbierens
Copy link
Contributor Author

Since it's been a week, I created a new image from commit 4b5c355 and reran some dist-tests. It's working and performance is as it should be.

@benbierens benbierens enabled auto-merge December 18, 2024 11:21
@benbierens
Copy link
Contributor Author

Spoke with Dmitriy about the following change:
Only add the want to the peerWant list if we don't have the block. If we do have the block, respond with a presence, but, no need to remember the want. This change was pushed in 23c0f30

I reran some dist-tests after. Performance and also the cancel-presence message performance impact is unaffected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants