Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency #745

Closed
wants to merge 3 commits into from

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Dec 17, 2024

Stack from ghstack (oldest at bottom):

Previously CI has two parts of GPU tests: (1) 4 GPU test which runs hourly (2) 8 GPU test which runs daily.

  • I think the frequency of (1) is unnecessary, the only thing outside PR changes which can break CI is pytorch-nightly release. This PR reduces the frequency to e.g. per-6-hours (4 times a day).
  • Since we reduce the frequency of (1) to be close to (2), can we consolidate them into one yaml. The overall “cost” might be lower instead of higher considering we only launch the container once. Also it’s going to be less confusing if we just show a single badge as “integration tests” on README.

Results:

  • CI used to run ~16min on 4-GPU test and ~10min on 8-GPU test
  • It now runs 18.5min on the 8-GPU test -- the main saving is from only pulling docker image once (around 4 min)

tianyu-l added a commit that referenced this pull request Dec 17, 2024
…ency

ghstack-source-id: 6f17554a2d4996c27e386adffee6a281658a3a56
Pull Request resolved: #745
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 17, 2024
@tianyu-l tianyu-l changed the title consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency [BE] consolidate 4-GPU integration tests into 8-GPU tests and reduce frequency Dec 17, 2024
Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some questions.

@@ -21,7 +21,7 @@ jobs:
with:
runner: linux.g5.48xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.1"
gpu-arch-version: "12.4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has someone been running local tests with 12.4 CUDA? Does it work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README.md Outdated
@@ -1,5 +1,4 @@
[![4 GPU Integration Test](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_4gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_4gpu.yaml?query=branch%3Amain)
[![8 GPU Integration Test](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml?query=branch%3Amain)
[![Integration Tests](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml?query=branch%3Amain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we only use the 8gpu AWS instance for CI from now on?

Copy link
Contributor Author

@tianyu-l tianyu-l Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the idea of this PR. Please share concerns you may have!

…and reduce frequency"


Previously CI has two parts of GPU tests: (1) 4 GPU test which runs hourly (2) 8 GPU test which runs daily.
- I think the frequency of (1) is unnecessary, the only thing outside PR changes which can break CI is pytorch-nightly release. This PR reduces the frequency to e.g. per-6-hours (4 times a day).
- Since we reduce the frequency of (1) to be close to (2), can we consolidate them into one yaml. The overall “cost” might be lower instead of higher considering we only launch the container once. Also it’s going to be less confusing if we just show a single badge as “integration tests” on README.

Results:
- CI used to run ~16min on 4-GPU test and ~10min on 8-GPU test
- It now runs 18.5min on the 8-GPU test -- the main saving is from only pulling docker image once (around 4 min)

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request Dec 17, 2024
…frequency

ghstack-source-id: 5e926c4f72bcf7d56c06f3ac0eae57fd235975ee
Pull Request resolved: #745
…and reduce frequency"


Previously CI has two parts of GPU tests: (1) 4 GPU test which runs hourly (2) 8 GPU test which runs daily.
- I think the frequency of (1) is unnecessary, the only thing outside PR changes which can break CI is pytorch-nightly release. This PR reduces the frequency to e.g. per-6-hours (4 times a day).
- Since we reduce the frequency of (1) to be close to (2), can we consolidate them into one yaml. The overall “cost” might be lower instead of higher considering we only launch the container once. Also it’s going to be less confusing if we just show a single badge as “integration tests” on README.

Results:
- CI used to run ~16min on 4-GPU test and ~10min on 8-GPU test
- It now runs 18.5min on the 8-GPU test -- the main saving is from only pulling docker image once (around 4 min)

[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request Dec 17, 2024
…frequency

ghstack-source-id: fd2ee0f4fa61861844f206135f0c6db48c51bbd0
Pull Request resolved: #745
@tianyu-l
Copy link
Contributor Author

encounter ghstack issues, reopening in #750

@tianyu-l tianyu-l closed this Dec 18, 2024
@tianyu-l tianyu-l deleted the gh/tianyu-l/31/head branch December 18, 2024 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants