Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async-hooks.test-writewrap is flaky #54809

Open
anonrig opened this issue Sep 6, 2024 · 12 comments
Open

async-hooks.test-writewrap is flaky #54809

anonrig opened this issue Sep 6, 2024 · 12 comments
Labels
async_hooks Issues and PRs related to the async hooks subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. linux Issues and PRs related to the Linux platform.

Comments

@anonrig
Copy link
Member

anonrig commented Sep 6, 2024

Test

async-hooks.test-writewrap

Platform

Linux x64

Console output

14:06:24 not ok 3869 async-hooks/test-writewrap
14:06:24   ---
14:06:24   duration_ms: 120029.98300
14:06:24   severity: fail
14:06:24   exitcode: -15
14:06:24   stack: |-
14:06:24     timeout
14:06:24   ...

Build links

Additional information

No response

@anonrig anonrig added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Sep 6, 2024
@github-actions github-actions bot added the linux Issues and PRs related to the Linux platform. label Sep 6, 2024
@RedYetiDev RedYetiDev added the async_hooks Issues and PRs related to the async hooks subsystem. label Sep 6, 2024
@aduh95
Copy link
Contributor

aduh95 commented Sep 6, 2024

Reproducible locally (macOS ARM), looks like it's a native issue (CRASHED (Signal: 11) means SIGSEGV)

tools/test.py --repeat 9999 test/async-hooks/test-writewrap.js
=== release test-writewrap ===                    
Path: async-hooks/test-writewrap
Command: out/Release/node /Users/duhamean/Documents/node/test/async-hooks/test-writewrap.js
--- CRASHED (Signal: 11) ---


=== release test-writewrap ===                    
Path: async-hooks/test-writewrap
Command: out/Release/node /Users/duhamean/Documents/node/test/async-hooks/test-writewrap.js
--- CRASHED (Signal: 11) ---


=== release test-writewrap ===                    
Path: async-hooks/test-writewrap
Command: out/Release/node /Users/duhamean/Documents/node/test/async-hooks/test-writewrap.js
--- CRASHED (Signal: 11) ---


=== release test-writewrap ===                    
Path: async-hooks/test-writewrap
Command: out/Release/node /Users/duhamean/Documents/node/test/async-hooks/test-writewrap.js
--- TIMEOUT ---


=== release test-writewrap ===                    
Path: async-hooks/test-writewrap
Command: out/Release/node /Users/duhamean/Documents/node/test/async-hooks/test-writewrap.js
--- TIMEOUT ---


=== release test-writewrap ===                    
Path: async-hooks/test-writewrap
Command: out/Release/node /Users/duhamean/Documents/node/test/async-hooks/test-writewrap.js
--- TIMEOUT ---


[08:07|% 100|+ 9993|-   6]: Done      

@lpinca
Copy link
Member

lpinca commented Sep 7, 2024

I get no failures on macOS

$ npx envinfo --system

  System:
    OS: macOS 14.6.1
    CPU: (16) x64 Intel(R) Xeon(R) W-2140B CPU @ 3.20GHz
    Memory: 7.72 GB / 32.00 GB
    Shell: 5.2.32 - /usr/local/bin/bash

$ tools/test.py --repeat 9999 test/async-hooks/test-writewrap.js
[03:03|% 100|+ 9999|-   0]: Done                  

All tests passed.

@RedYetiDev
Copy link
Member

FWIW I can't reproduce on Linux x64:

  System:
    OS: Linux 6.8 Kali GNU/Linux Rolling 2024.3
    CPU: (12) x64 AMD Ryzen 5 5600G with Radeon Graphics
    Memory: 21.71 GB / 27.30 GB
    Container: Yes
    Shell: 5.9 - /usr/bin/zsh
$ tools/test.py --repeat 9999 test/async-hooks/test-writewrap.js
[02:35|% 100|+ 9999|-   0]: Done                  

All tests passed.

Maybe it has to do with the amount of memory available?

@jasnell
Copy link
Member

jasnell commented Sep 8, 2024

I think with many of these timeout tests we should consider moving them into the sequential suite to see if that helps

@lpinca
Copy link
Member

lpinca commented Sep 8, 2024

@jasnell no, we will likely end up hiding a real bug. Read the discussions in #52959 and #52964.

@jasnell
Copy link
Member

jasnell commented Sep 8, 2024

@lpinca ... we don't have to actually land the PR. I want to determine if moving the test would make it less flaky which can be done with a PR and a few stress test runs. The PR doesn't actually have to land. That said, we have to get the CI back to a point where we aren't having to do 10-15 CI runs on a PR just to get it landed. If that means we freeze all development on everything else until CI is green again, let's do that.

@lpinca
Copy link
Member

lpinca commented Sep 8, 2024

I want to determine if moving the test would make it less flaky which can be done with a PR and a few stress test runs. The PR doesn't actually have to land.

Ofc it will, it will not trigger the issue... The stress test CI does not run the test in parallel.

That said, we have to get the CI back to a point where we aren't having to do 10-15 CI runs on a PR just to get it landed.

Let's start by fixing the issues with the macOS machines which is what actually blocked development.

@lpinca
Copy link
Member

lpinca commented Sep 8, 2024

@jasnell try this

python3 tools/test.py --repeat=10000 test/parallel/test-net-write-fully-async-hex-string.js -t 10

or

python3 tools/test.py --repeat=10000 test/parallel/test-buffer-tostring-rangeerror.js -t 10

I think that if we find the reason why the process sometimes hangs we will be able to fix the timeout issue for many tests.

@jasnell
Copy link
Member

jasnell commented Sep 8, 2024

Let's start by fixing the issues with the macOS machines which is what actually blocked development.

I can't do anything about the macOS machines right now. I can work on investigating these timeouts, which is what I'm trying to do.

Ofc it will, it will not trigger the issue....

Has that been tested? If so, can you point out where?

@jasnell
Copy link
Member

jasnell commented Sep 8, 2024

Going through the test-net-write-fully-async-hex-string.js test... I'm able to eliminate timeouts in local testing by reducing the size of the initial buffer allocation. If I keep the allocation under the default high water mark for the stream it appears to never time out. If the buffer passed into the write operation repeatedly is over the high water mark, it times out. The higher the allocation, the more frequently the test times out.

@jasnell
Copy link
Member

jasnell commented Sep 9, 2024

Running some local tests on test-net-write-fully-async-hex-string.js ... here are the results I'm getting locally

Allocation Size Average Timeouts per 10000 repetitions Runs Timeout
65600 8 10 60s
65580 4 10 60s
65570 0 10 60s
65560 0 10 60s
65550 0 10 60s

So far this is the only variable that I've been able to identify that reliably alters the rate of timeout on this particular test.

The reason for choosing 65550 is that it is fairly close (but still over the default high water mark for the stream. Setting the highwatermark to 65600 appears to reduce the rate of timeout when using 65600 as the allocation size.

This makes me think that the tests that are timing out are due to some intersection between v8 GC and streams backpressure management -- I'm unable to reproduce any timeouts for any allocations smaller than the default high water mark size.

/cc @mcollina @ronag @anonrig for visibility

@lpinca
Copy link
Member

lpinca commented Sep 9, 2024

@jasnell in my environment the issue seems different, see #52959 (comment).
The server emits 'close', meaning that all data is flushed. It does not seem to be a backpressure issue. Anyway, let's continue the discussion on #52959 so as not to create noise here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
async_hooks Issues and PRs related to the async hooks subsystem. flaky-test Issues and PRs related to the tests with unstable failures on the CI. linux Issues and PRs related to the Linux platform.
Projects
None yet
Development

No branches or pull requests

5 participants