Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: No time-based loss detection before largest_acked & fixes #1998

Draft
wants to merge 76 commits into
base: main
Choose a base branch
from

Conversation

larseggert
Copy link
Collaborator

@larseggert larseggert commented Jul 21, 2024

I will refactor this PR into smaller standalone ones.

#1603 broke this, by falsely enabling time-based loss detection before we have received a valid largest_acked.

6.1.2. Time Threshold

Once a later packet within the same packet number space has been acknowledged, an endpoint SHOULD declare an earlier packet lost if it was sent a threshold amount of time in the past.

This makes it clear that time-based LD should be disabled before an ACK is received.

A.10. Detecting Lost Packets

assert(largest_acked_packet[pn_space] != infinite)

The DetectAndRemoveLostPackets pseudo code also has the assert above.

When a PTO fires, we now explicitly also do a congestion reaction.

Fixes #1874

Also includes some fixes to hopefully make retransmissions during the handshake more robust, by timing out all packets in all packet number spaces affected by a PTO (making their contents eligible for RTX) and by not applying ack_delay to the application packet number space PTO during the handshake.

Copy link

github-actions bot commented Jul 21, 2024

Failed Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Copy link

github-actions bot commented Jul 22, 2024

Benchmark results

Performance differences relative to d6279bf.

coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [99.781 ns 100.06 ns 100.35 ns]
       change: [-0.0691% +0.3854% +0.8727%] (p = 0.13 > 0.05)

Found 12 outliers among 100 measurements (12.00%)
9 (9.00%) high mild
3 (3.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.
       time:   [117.84 ns 118.21 ns 118.61 ns]
       change: [-0.1498% +0.3222% +0.8142%] (p = 0.19 > 0.05)

Found 21 outliers among 100 measurements (21.00%)
3 (3.00%) low severe
1 (1.00%) low mild
4 (4.00%) high mild
13 (13.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [117.66 ns 118.18 ns 118.78 ns]
       change: [-0.3301% +0.3180% +0.8978%] (p = 0.32 > 0.05)

Found 21 outliers among 100 measurements (21.00%)
6 (6.00%) low severe
2 (2.00%) low mild
3 (3.00%) high mild
10 (10.00%) high severe

coalesce_acked_from_zero 1000+1 entries: Change within noise threshold.
       time:   [98.191 ns 98.432 ns 98.741 ns]
       change: [+0.2702% +1.2029% +2.1014%] (p = 0.01 < 0.05)

Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) high mild
7 (7.00%) high severe

RxStreamOrderer::inbound_frame(): No change in performance detected.
       time:   [110.98 ms 111.11 ms 111.30 ms]
       change: [-0.1477% -0.0232% +0.1488%] (p = 0.82 > 0.05)

Found 11 outliers among 100 measurements (11.00%)
7 (7.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe

transfer/pacing-false/varying-seeds: No change in performance detected.
       time:   [28.531 ms 29.337 ms 30.147 ms]
       change: [-0.8630% +4.2908% +9.9250%] (p = 0.12 > 0.05)
transfer/pacing-true/varying-seeds: Change within noise threshold.
       time:   [37.814 ms 39.294 ms 40.754 ms]
       change: [+0.5884% +7.1476% +14.299%] (p = 0.04 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) low mild

transfer/pacing-false/same-seed: 💚 Performance has improved.
       time:   [20.227 ms 21.095 ms 21.965 ms]
       change: [-23.657% -19.714% -15.468%] (p = 0.00 < 0.05)
transfer/pacing-true/same-seed: 💚 Performance has improved.
       time:   [20.852 ms 21.936 ms 23.034 ms]
       change: [-52.505% -49.129% -45.393%] (p = 0.00 < 0.05)
1-conn/1-100mb-resp (aka. Download)/client: No change in performance detected.
       time:   [113.70 ms 114.11 ms 114.50 ms]
       thrpt:  [873.36 MiB/s 876.33 MiB/s 879.50 MiB/s]
change:
       time:   [-0.7008% -0.2614% +0.1863%] (p = 0.26 > 0.05)
       thrpt:  [-0.1860% +0.2621% +0.7057%]

Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low severe
1 (1.00%) low mild

1-conn/10_000-parallel-1b-resp (aka. RPS)/client: No change in performance detected.
       time:   [312.75 ms 316.12 ms 319.49 ms]
       thrpt:  [31.300 Kelem/s 31.633 Kelem/s 31.974 Kelem/s]
change:
       time:   [-2.4277% -0.8752% +0.6564%] (p = 0.27 > 0.05)
       thrpt:  [-0.6521% +0.8830% +2.4881%]

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

1-conn/1-1b-resp (aka. HPS)/client: No change in performance detected.
       time:   [33.867 ms 34.070 ms 34.294 ms]
       thrpt:  [29.160  elem/s 29.351  elem/s 29.527  elem/s]
change:
       time:   [-1.5619% -0.6708% +0.2182%] (p = 0.13 > 0.05)
       thrpt:  [-0.2177% +0.6753% +1.5866%]

Found 12 outliers among 100 measurements (12.00%)
3 (3.00%) low mild
2 (2.00%) high mild
7 (7.00%) high severe

Client/server transfer results

Transfer of 33554432 bytes over loopback.

Client Server CC Pacing Mean [ms] Min [ms] Max [ms] Relative
msquic msquic 173.1 ± 109.5 100.9 564.9 1.00
neqo msquic reno on 243.7 ± 61.5 212.3 434.9 1.00
neqo msquic reno 257.1 ± 67.1 218.3 428.7 1.00
neqo msquic cubic on 242.9 ± 63.0 213.1 439.9 1.00
neqo msquic cubic 294.2 ± 98.2 217.5 455.0 1.00
msquic neqo reno on 174.8 ± 101.8 81.9 336.3 1.00
msquic neqo reno 147.6 ± 94.0 84.3 360.6 1.00
msquic neqo cubic on 139.7 ± 78.4 83.8 324.7 1.00
msquic neqo cubic 137.2 ± 89.0 80.5 373.3 1.00
neqo neqo reno on 196.4 ± 71.6 129.8 366.3 1.00
neqo neqo reno 190.6 ± 95.6 125.1 434.5 1.00
neqo neqo cubic on 196.8 ± 74.1 131.4 381.0 1.00
neqo neqo cubic 249.7 ± 111.5 133.1 471.2 1.00

⬇️ Download logs

@larseggert larseggert marked this pull request as draft July 22, 2024 00:13
Copy link

github-actions bot commented Jul 22, 2024

Firefox builds for this PR

The following builds are available for testing. Crossed-out builds did not succeed.

  • Linux: Debug Release
  • macOS: Debug Release
  • Windows: Debug Release

@larseggert larseggert changed the title fix: Factor in max ACK delay when declaring packets lost fix: Don't do time-based loss detection before largest_acked Jul 22, 2024
Copy link

codecov bot commented Jul 22, 2024

Codecov Report

Attention: Patch coverage is 99.08257% with 1 line in your changes missing coverage. Please review.

Please upload report for BASE (main@d6279bf). Learn more about missing BASE report.

Files with missing lines Patch % Lines
neqo-transport/src/path.rs 87.50% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1998   +/-   ##
=======================================
  Coverage        ?   95.31%           
=======================================
  Files           ?      112           
  Lines           ?    36352           
  Branches        ?        0           
=======================================
  Hits            ?    34648           
  Misses          ?     1704           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@larseggert
Copy link
Collaborator Author

Kwik failure: ptrd/kwik#46

@larseggert
Copy link
Collaborator Author

Quinn failure: quinn-rs/quinn#1929

@larseggert larseggert marked this pull request as ready for review July 22, 2024 04:53
Copy link
Collaborator

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Patch looks good to me.

neqo-transport/src/recovery/mod.rs Outdated Show resolved Hide resolved
test-fixture/src/lib.rs Outdated Show resolved Hide resolved
@larseggert larseggert marked this pull request as draft July 22, 2024 13:20
@larseggert larseggert mentioned this pull request Jul 22, 2024
Copy link
Member

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good.

neqo-transport/src/recovery/mod.rs Outdated Show resolved Hide resolved
@larseggert larseggert marked this pull request as ready for review July 22, 2024 21:00
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 16, 2024
If we pace, we might get the initial server flight before sending
sufficient 0-RTT data to pass the QNS check.

Broken out of mozilla#1998
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 16, 2024
This caused some test failures, and @martinthomson discovered this
as the reason at the IETF 120 hackathon.

Broken out of mozilla#1998
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 16, 2024
Packets are only declared as lost relative to `largest_acked`. If we hit
a PTO while we don't have a largest_acked yet, also do a congestion
control reaction (because otherwise none would happen).

Broken out of mozilla#1998
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 16, 2024
This is what RFC9000 says, and it helps with servers that expect ACKs in
the Initial packet number space to happen. It also eliminates a
difference in behavior we had when a client received an Initial packet
containing an ACK and a CRYPTO frame vs. a CRYPTO frame and an ACK,
where that CRYPTO frame caused the Initial packet number space to be
dropped.

Broken out of mozilla#1998
github-merge-queue bot pushed a commit that referenced this pull request Sep 16, 2024
If we pace, we might get the initial server flight before sending
sufficient 0-RTT data to pass the QNS check.

Broken out of #1998
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 16, 2024
We don't track which packets are coalesced with others, so when we
detect a loss in one packet number space, we cannot resend any coalesced
packets in other packet number space. This PR tries to approximate this
behavior by scheduling un-ACKed Handshake and 0-RTT packets for RTX when
packets are lost in the Initial space.

Broken out of mozilla#1998
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 17, 2024
Funtion is only used internally.

Broken out of mozilla#1998
github-merge-queue bot pushed a commit that referenced this pull request Sep 17, 2024
Funtion is only used internally.

Broken out of #1998
github-merge-queue bot pushed a commit that referenced this pull request Sep 17, 2024
* fix: Don't encode large RTT guesses in tickets

Because under lossy conditions (e.g., QNS `handshakeloss` test), the
guess can be multiple times the actual RTT, which when encoded in the
resumption ticket will cause an extremely slow second handshake, often
causing the test to time out.

Broken out of #1998
Fixes #2088

* Fixes & tests

* Suggestion from @mxinden

* Fix
larseggert added a commit that referenced this pull request Sep 18, 2024
We don't track which packets are coalesced with others, so when we
detect a loss in one packet number space, we cannot resend any coalesced
packets in other packet number space. This PR tries to approximate this
behavior by scheduling un-ACKed Handshake and 0-RTT packets for RTX when
packets are lost in the Initial space.

Broken out of #1998
larseggert added a commit that referenced this pull request Sep 18, 2024
This caused some test failures, and @martinthomson discovered this
as the reason at the IETF 120 hackathon.

Broken out of #1998
Signed-off-by: Lars Eggert <lars@eggert.org>
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 18, 2024
Rather than having the caller determine for which space a PTO should be calculated for.

Broken out of mozilla#1998

I'm ambivalent if we want this change - thoughts?
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 18, 2024
These should pass once the PR series replacing mozilla#1998 has landed.

Broken out of mozilla#1998
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 18, 2024
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 18, 2024
larseggert added a commit to larseggert/neqo that referenced this pull request Sep 19, 2024
We'd previously only mark 1 one or two packets as lost when a PTO fired.
That meant that we potentially didn't RTX all data that we could have
(i.e., that was in lost packets that we didn't mark lost).

This also changes the probing code to suppress redundant keep-alives,
i.e., PINGs that we sent for other reasons, which could double as
keep-alives but did not.

Broken out of mozilla#1998
github-merge-queue bot pushed a commit that referenced this pull request Oct 1, 2024
* fix: Add `confirmed` parameter to PTO calculation

Rather than having the caller determine for which space a PTO should be calculated for.

Broken out of #1998

I'm ambivalent if we want this change - thoughts?

* More from #1998

* More from #1998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zerortt QNS interop failures Huge timeout in L1 QNS test
4 participants