[BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency #745

tianyu-l · 2024-12-17T01:07:27Z

Stack from ghstack (oldest at bottom):

Previously CI has two parts of GPU tests: (1) 4 GPU test which runs hourly (2) 8 GPU test which runs daily.

I think the frequency of (1) is unnecessary, the only thing outside PR changes which can break CI is pytorch-nightly release. This PR reduces the frequency to e.g. per-6-hours (4 times a day).
Since we reduce the frequency of (1) to be close to (2), can we consolidate them into one yaml. The overall “cost” might be lower instead of higher considering we only launch the container once. Also it’s going to be less confusing if we just show a single badge as “integration tests” on README.

Results:

CI used to run ~16min on 4-GPU test and ~10min on 8-GPU test
It now runs 18.5min on the 8-GPU test -- the main saving is from only pulling docker image once (around 4 min)

…ency [ghstack-poisoned]

…ency ghstack-source-id: 6f17554a2d4996c27e386adffee6a281658a3a56 Pull Request resolved: #745

XilunWu

Left some questions.

XilunWu · 2024-12-17T01:36:24Z

.github/workflows/integration_test_8gpu.yaml

@@ -21,7 +21,7 @@ jobs:
    with:
      runner: linux.g5.48xlarge.nvidia.gpu
      gpu-arch-type: cuda
-      gpu-arch-version: "12.1"
+      gpu-arch-version: "12.4"


Has someone been running local tests with 12.4 CUDA? Does it work?

I haven't. To be honest I'm not sure.
This change is mainly for consistency with https://github.com/pytorch/torchtitan/pull/745/files#diff-e327f3f247423713ee949ef4eef6b82de392abca8c53137159d82f073510c4f9R39

XilunWu · 2024-12-17T01:38:07Z

README.md

@@ -1,5 +1,4 @@
-[![4 GPU Integration Test](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_4gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_4gpu.yaml?query=branch%3Amain)
-[![8 GPU Integration Test](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml?query=branch%3Amain)
+[![Integration Tests](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml?query=branch%3Amain)


Will we only use the 8gpu AWS instance for CI from now on?

Yes, this is the idea of this PR. Please share concerns you may have!

…and reduce frequency" Previously CI has two parts of GPU tests: (1) 4 GPU test which runs hourly (2) 8 GPU test which runs daily. - I think the frequency of (1) is unnecessary, the only thing outside PR changes which can break CI is pytorch-nightly release. This PR reduces the frequency to e.g. per-6-hours (4 times a day). - Since we reduce the frequency of (1) to be close to (2), can we consolidate them into one yaml. The overall “cost” might be lower instead of higher considering we only launch the container once. Also it’s going to be less confusing if we just show a single badge as “integration tests” on README. Results: - CI used to run ~16min on 4-GPU test and ~10min on 8-GPU test - It now runs 18.5min on the 8-GPU test -- the main saving is from only pulling docker image once (around 4 min) [ghstack-poisoned]

…frequency ghstack-source-id: 5e926c4f72bcf7d56c06f3ac0eae57fd235975ee Pull Request resolved: #745

…and reduce frequency" Previously CI has two parts of GPU tests: (1) 4 GPU test which runs hourly (2) 8 GPU test which runs daily. - I think the frequency of (1) is unnecessary, the only thing outside PR changes which can break CI is pytorch-nightly release. This PR reduces the frequency to e.g. per-6-hours (4 times a day). - Since we reduce the frequency of (1) to be close to (2), can we consolidate them into one yaml. The overall “cost” might be lower instead of higher considering we only launch the container once. Also it’s going to be less confusing if we just show a single badge as “integration tests” on README. Results: - CI used to run ~16min on 4-GPU test and ~10min on 8-GPU test - It now runs 18.5min on the 8-GPU test -- the main saving is from only pulling docker image once (around 4 min) [ghstack-poisoned]

…frequency ghstack-source-id: fd2ee0f4fa61861844f206135f0c6db48c51bbd0 Pull Request resolved: #745

tianyu-l · 2024-12-18T00:35:36Z

encounter ghstack issues, reopening in #750

consolidate 4-GPU integration tests into 8-GPU tests and reduce frequ…

71b96ed

…ency [ghstack-poisoned]

This was referenced Dec 17, 2024

[BE] dump compile trace to CI output for debugging, and reduce uploaded folder size in CI #739

Merged

[BE] restructure tests and assets folders #740

Closed

[BE] add integration test for the generation script #741

Closed

tianyu-l added a commit that referenced this pull request Dec 17, 2024

consolidate 4-GPU integration tests into 8-GPU tests and reduce frequ…

6901a2c

…ency ghstack-source-id: 6f17554a2d4996c27e386adffee6a281658a3a56 Pull Request resolved: #745

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 17, 2024

tianyu-l changed the title ~~consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency~~ [BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency Dec 17, 2024

tianyu-l requested review from wconstab, fegin and XilunWu December 17, 2024 01:29

XilunWu reviewed Dec 17, 2024

View reviewed changes

tianyu-l added a commit that referenced this pull request Dec 17, 2024

[BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce …

6987766

…frequency ghstack-source-id: 5e926c4f72bcf7d56c06f3ac0eae57fd235975ee Pull Request resolved: #745

fegin approved these changes Dec 17, 2024

View reviewed changes

tianyu-l added a commit that referenced this pull request Dec 17, 2024

[BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce …

82df53c

…frequency ghstack-source-id: fd2ee0f4fa61861844f206135f0c6db48c51bbd0 Pull Request resolved: #745

tianyu-l closed this Dec 18, 2024

tianyu-l deleted the gh/tianyu-l/31/head branch December 18, 2024 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency #745

[BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency #745

tianyu-l commented Dec 17, 2024 •

edited

Loading

XilunWu left a comment

XilunWu Dec 17, 2024

tianyu-l Dec 17, 2024

XilunWu Dec 17, 2024

tianyu-l Dec 17, 2024 •

edited

Loading

tianyu-l commented Dec 18, 2024

[BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency #745

[BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency #745

Conversation

tianyu-l commented Dec 17, 2024 • edited Loading

XilunWu left a comment

Choose a reason for hiding this comment

XilunWu Dec 17, 2024

Choose a reason for hiding this comment

tianyu-l Dec 17, 2024

Choose a reason for hiding this comment

XilunWu Dec 17, 2024

Choose a reason for hiding this comment

tianyu-l Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

tianyu-l commented Dec 18, 2024

tianyu-l commented Dec 17, 2024 •

edited

Loading

tianyu-l Dec 17, 2024 •

edited

Loading