Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test-triton.sh] Add an option to unskip all pytest.skip() calls #1345

Merged
merged 3 commits into from
Jun 29, 2024

Conversation

AshburnLee
Copy link
Contributor

@AshburnLee AshburnLee commented Jun 13, 2024

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because does not change any triton code.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

Functionality

Fix #987

Usage

./scripts/test-triton.sh default run
./scripts/test-triton.sh --unskip --ignore-errrors replace all pytest.skip with pass

PR also support unskip test on CI, you can manually trigger test from CI, and set enable_unskip as true (false as default).
If CI test is triggered frompull_request or push, TEST_UNSKIP is always false.

Result

Take test_dot as an example:

./scripts/test-triton.sh --unskip

FAILED language/test_core.py::test_dot[1-128-128-64-4-False-False-chain-dot-ieee-float8e5-float32-1] - RuntimeError: Triton Error [ZE]: 0x70000004
FAILED language/test_core.py::test_dot[1-128-128-64-4-False-False-chain-dot-ieee-float8e4nv-float32-1] - RuntimeError: Triton Error [ZE]: 0x70000004
./scripts/test-triton.sh

[gw7] [ 80%] SKIPPED language/test_core.py::test_dot[1-128-128-64-4-False-False-chain-dot-ieee-float8e5-float32-1]
[gw7] [ 80%] SKIPPED language/test_core.py::test_dot[1-128-128-64-4-False-False-chain-dot-ieee-float8e4nv-float32-1] 

Results are expected.

@AshburnLee AshburnLee self-assigned this Jun 13, 2024
@AshburnLee AshburnLee changed the title [test-triton.sh] Add an option to unskip all pytest.skip() calls [do not review] [test-triton.sh] Add an option to unskip all pytest.skip() calls Jun 13, 2024
@AshburnLee AshburnLee force-pushed the lijunhui/option branch 2 times, most recently from 4ce54b4 to 34f7043 Compare June 14, 2024 07:56
@AshburnLee AshburnLee requested a review from vlad-penkin June 14, 2024 07:56
@AshburnLee AshburnLee changed the title [do not review] [test-triton.sh] Add an option to unskip all pytest.skip() calls [test-triton.sh] Add an option to unskip all pytest.skip() calls Jun 14, 2024
@AshburnLee AshburnLee requested a review from pbchekin June 14, 2024 15:11
@pbchekin
Copy link
Contributor

When enable_unskip is true you also need to disable the skip list.

@AshburnLee
Copy link
Contributor Author

When enable_unskip is true you also need to disable the skip list.

Done, thanks~

scripts/pytest-utils.sh Outdated Show resolved Hide resolved
scripts/pytest-utils.sh Outdated Show resolved Hide resolved
@AshburnLee AshburnLee force-pushed the lijunhui/option branch 4 times, most recently from 854e7b6 to 4a09d71 Compare June 17, 2024 11:53
scripts/pytest-utils.sh Outdated Show resolved Hide resolved
@pbchekin pbchekin self-requested a review June 19, 2024 15:59
@pbchekin
Copy link
Contributor

Note that with enable_unskip = true the total number of tests is less than in a normal mode. Examples:

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9584339696/job/26427723595

all: passed: 16980, failed: 111, skipped: 196, xfailed: 1142, total: 18429, fixme: 15, pass rate (w/o xfailed): 98.22%

vs

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9583508901/job/26424953218

all: passed: 17603, failed: 0, skipped: 396, xfailed: 1240, total: 19239, fixme: 16, pass rate (w/o xfailed): 97.8%

In both modes, it must be the same total number of tests.

@AshburnLee
Copy link
Contributor Author

AshburnLee commented Jun 20, 2024

In both modes, it must be the same total number of tests.

When TEST_UNSKIP mode, deselect file is not enabled, so cases in the deselect file actually are counted twice!

Take subprogress as an example:

  1. TEST_UNSKIP mode
    subprocess: passed: 22, failed: 12, skipped: 12, xfailed: 0, total: 46 ( 22+12+12=46 check)
  2. Normal mode
    subprocess: passed: 22, failed: 0, skipped: 12, xfailed: 0, total: 34 (22+12=34 check)

However in TEST_UNSKIP mode, 12 cases in deselect file were executed failed, they are also counted as skipped. That is why, we got failed: 12, skipped: 12. I've checked the log, 12 cases appeared in both log & deselect file.

The same goes operator: 3 skiplist cases passed, and twice counted as skipped, so (TEST_UNSKIP mode)964-3=(normal mode)
961.

I have not checked the language, as there are too many of them, and need to check why TEST_UNSKIP mode is less than in a normal mode. But I think we have narrowed down the scope of the cause of the problem.

This is a bug from pass_rate.py, I think.
Let close this PR and I'll solve pass_rate.py bug in another PR?

@pbchekin
Copy link
Contributor

This is a bug from pass_rate.py, I think.

This is not a bug, pass_rate.py does not know about TEST_UNSKIP, you need to make a change to the script.

Let close this PR and I'll solve pass_rate.py bug in another PR?

We will merge this PR when the total number of tests will be the same in both modes, and the number of passed tests in TEST_UNSKIP=true mode will be equal or more than in TEST_UNSKIP=false mode. Currently the both conditions are not met.

@AshburnLee
Copy link
Contributor Author

This is a bug from pass_rate.py, I think.

This is not a bug, pass_rate.py does not know about TEST_UNSKIP, you need to make a change to the script.

Let close this PR and I'll solve pass_rate.py bug in another PR?

We will merge this PR when the total number of tests will be the same in both modes, and the number of passed tests in TEST_UNSKIP=true mode will be equal or more than in TEST_UNSKIP=false mode. Currently the both conditions are not met.

That's reasonable. Working on it.

@AshburnLee AshburnLee force-pushed the lijunhui/option branch 2 times, most recently from ddccfdf to f7b54b6 Compare June 22, 2024 03:45
@AshburnLee
Copy link
Contributor Author

That's reasonable. Working on it.

After digging in, found that, it is not calculation error, but an execution error. In unskip mode, language cases were not fully executed! you will find log:

2024-06-19T15:31:20.2909782Z [gw4] node down: Not properly terminated
2024-06-19T15:31:20.2920296Z [gw4] [ 91%] FAILED language/test_core.py::test_dot3d[int8-int8-64-64-64-1-1] 
2024-06-19T15:31:20.2922403Z 
2024-06-19T15:31:20.2946993Z maximum crashed workers reached: 32
2024-06-19T15:31:20.4845988Z 
2024-06-19T15:31:20.4848034Z =================================== FAILURES ===================================
2024-06-19T15:31:20.4850241Z _ test_precise_math[1-tl.math.sqrt_rn(x)-tl.math.sqrt(x.to(tl.float64)).to(tl.float32)] 

It stopped at 91%

While in default mode:

2024-06-19T14:29:22.2721110Z [gw4] [ 99%] PASSED language/test_standard.py::test_flip[bfloat16-512-8] 
2024-06-19T14:29:24.3475211Z [gw0] [100%] PASSED language/test_standard.py::test_flip[float16-512-8] 
2024-06-19T14:29:24.3481540Z 
2024-06-19T14:29:24.3483144Z =============================== warnings summary ===============================
2024-06-19T14:29:24.3484227Z test/unit/language/test_core.py: 99 warnings

It stopped at 100%
And (total cases in unskip/total case in normal) = 10377/11202 = 92.6%

@AshburnLee AshburnLee force-pushed the lijunhui/option branch 3 times, most recently from fddb48f to 9d2120d Compare June 26, 2024 05:57
@AshburnLee
Copy link
Contributor Author

AshburnLee commented Jun 26, 2024

In un-skip mode, ./scripts/pre-run.sh will be executed to generate a scripts/skiplist/default/coredump.txt(in test-triton.sh & build-test.yml), which will be de-selected during formal run and selected in pass_rate.py.

Actually there are two option to ensure test fully being executed:

  1. Run ./scripts/pre-run.sh before formal run. Or
  2. Upload scripts/skiplist/default/coredump.txt and remove ./scripts/pre-run.sh. This will save some time in un-skip mode

Please have a look, @pbchekin @vlad-penkin , I prefer option#2, how do you think?

@AshburnLee AshburnLee force-pushed the lijunhui/option branch 2 times, most recently from bd0247c to bea172a Compare June 26, 2024 08:58
Copy link
Contributor

@pbchekin pbchekin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run didn't finish for LTS, please take a look. Also there are several comments for the code.

.github/workflows/build-test.yml Outdated Show resolved Hide resolved
scripts/pass_rate.py Outdated Show resolved Hide resolved
scripts/pass_rate.py Outdated Show resolved Hide resolved
scripts/pass_rate.py Outdated Show resolved Hide resolved
scripts/pass_rate.py Outdated Show resolved Hide resolved
scripts/pytest-utils.sh Outdated Show resolved Hide resolved
> "$TESTED_CASES"
> "$TMP_LOG"

while true; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea to restart pytest after test crash, pytest-xdist automatically handles this, see https://pytest-xdist.readthedocs.io/en/stable/crash.html. The pre-run.sh script basically runs all core tests, then you run them again in the workflow, that is a waste of time and compute resources. Potentially you just need to set --max-worker-restart to a bigger number and do everything in one pass.

Copy link
Contributor Author

@AshburnLee AshburnLee Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--max-worker-restart works, core-dump cases are classified as FAILED.

  • unskip mode:
    language: passed: 10351, failed: 344, skipped: 1, xfailed: 521, total: 11217, fixme: 0, pass rate (w/o xfailed): 96.77%
    line_info: passed: 5, failed: 0, skipped: 0, xfailed: 0, total: 5, fixme: 0, pass rate (w/o xfailed): 100.0%
    regression: passed: 16, failed: 0, skipped: 0, xfailed: 8, total: 24, fixme: 0, pass rate (w/o xfailed): 100.0%
    runtime: passed: 39, failed: 0, skipped: 0, xfailed: 8, total: 47, fixme: 0, pass rate (w/o xfailed): 100.0%
    subprocess: passed: 22, failed: 12, skipped: 0, xfailed: 0, total: 34, fixme: 0, pass rate (w/o xfailed): 64.71%
    all: passed: 10433, failed: 356, skipped: 1, xfailed: 537, total: 11327, fixme: 0, pass rate (w/o xfailed): 96.69%

  • default mode:
    language: passed: 10349, failed: 0, skipped: 347, xfailed: 521, total: 11217, fixme: 0, pass rate (w/o xfailed): 96.76%
    line_info: passed: 5, failed: 0, skipped: 0, xfailed: 0, total: 5, fixme: 0, pass rate (w/o xfailed): 100.0%
    regression: passed: 16, failed: 0, skipped: 0, xfailed: 8, total: 24, fixme: 0, pass rate (w/o xfailed): 100.0%
    runtime: passed: 39, failed: 0, skipped: 0, xfailed: 8, total: 47, fixme: 0, pass rate (w/o xfailed): 100.0%
    subprocess: passed: 22, failed: 0, skipped: 12, xfailed: 0, total: 34, fixme: 0, pass rate (w/o xfailed): 64.71%
    all: passed: 10431, failed: 0, skipped: 359, xfailed: 537, total: 11327, fixme: 0, pass rate (w/o xfailed): 96.67%

Let's see the CI.

Copy link
Contributor Author

@AshburnLee AshburnLee Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still 1 case missing from CI, I set -n8 --max-worker-restart=300, It stopped at 99%. 😅
In fact the above configure does not always reach 100% from local.
So I change it to 500.

@pbchekin pbchekin self-requested a review June 26, 2024 19:47
@AshburnLee
Copy link
Contributor Author

The run didn't finish for LTS, please take a look. Also there are several comments for the code.

Because the job running on runner triton-fox125-1 has exceeded the maximum execution time of 360 minutes. LTS run got canceled by CI .

@AshburnLee AshburnLee force-pushed the lijunhui/option branch 3 times, most recently from e8bc8f0 to 289d432 Compare June 27, 2024 09:47
.github/workflows/build-test-reusable.yml Outdated Show resolved Hide resolved
scripts/test-triton.sh Outdated Show resolved Hide resolved
scripts/pytest-utils.sh Outdated Show resolved Hide resolved
@pbchekin pbchekin self-requested a review June 27, 2024 15:17
@AshburnLee AshburnLee merged commit fd06146 into llvm-target Jun 29, 2024
3 checks passed
@AshburnLee AshburnLee deleted the lijunhui/option branch August 9, 2024 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[test-triton.sh] Add an option to unskip all pytest.skip() calls
2 participants