-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency #745
Conversation
…ency [ghstack-poisoned]
…ency ghstack-source-id: 6f17554a2d4996c27e386adffee6a281658a3a56 Pull Request resolved: #745
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions.
@@ -21,7 +21,7 @@ jobs: | |||
with: | |||
runner: linux.g5.48xlarge.nvidia.gpu | |||
gpu-arch-type: cuda | |||
gpu-arch-version: "12.1" | |||
gpu-arch-version: "12.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has someone been running local tests with 12.4 CUDA? Does it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't. To be honest I'm not sure.
This change is mainly for consistency with https://github.com/pytorch/torchtitan/pull/745/files#diff-e327f3f247423713ee949ef4eef6b82de392abca8c53137159d82f073510c4f9R39
README.md
Outdated
@@ -1,5 +1,4 @@ | |||
[![4 GPU Integration Test](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_4gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_4gpu.yaml?query=branch%3Amain) | |||
[![8 GPU Integration Test](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml?query=branch%3Amain) | |||
[![Integration Tests](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml?query=branch%3Amain) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we only use the 8gpu AWS instance for CI from now on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the idea of this PR. Please share concerns you may have!
…and reduce frequency" Previously CI has two parts of GPU tests: (1) 4 GPU test which runs hourly (2) 8 GPU test which runs daily. - I think the frequency of (1) is unnecessary, the only thing outside PR changes which can break CI is pytorch-nightly release. This PR reduces the frequency to e.g. per-6-hours (4 times a day). - Since we reduce the frequency of (1) to be close to (2), can we consolidate them into one yaml. The overall “cost” might be lower instead of higher considering we only launch the container once. Also it’s going to be less confusing if we just show a single badge as “integration tests” on README. Results: - CI used to run ~16min on 4-GPU test and ~10min on 8-GPU test - It now runs 18.5min on the 8-GPU test -- the main saving is from only pulling docker image once (around 4 min) [ghstack-poisoned]
…frequency ghstack-source-id: 5e926c4f72bcf7d56c06f3ac0eae57fd235975ee Pull Request resolved: #745
…and reduce frequency" Previously CI has two parts of GPU tests: (1) 4 GPU test which runs hourly (2) 8 GPU test which runs daily. - I think the frequency of (1) is unnecessary, the only thing outside PR changes which can break CI is pytorch-nightly release. This PR reduces the frequency to e.g. per-6-hours (4 times a day). - Since we reduce the frequency of (1) to be close to (2), can we consolidate them into one yaml. The overall “cost” might be lower instead of higher considering we only launch the container once. Also it’s going to be less confusing if we just show a single badge as “integration tests” on README. Results: - CI used to run ~16min on 4-GPU test and ~10min on 8-GPU test - It now runs 18.5min on the 8-GPU test -- the main saving is from only pulling docker image once (around 4 min) [ghstack-poisoned]
…frequency ghstack-source-id: fd2ee0f4fa61861844f206135f0c6db48c51bbd0 Pull Request resolved: #745
encounter ghstack issues, reopening in #750 |
Stack from ghstack (oldest at bottom):
Previously CI has two parts of GPU tests: (1) 4 GPU test which runs hourly (2) 8 GPU test which runs daily.
Results: