-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 65k GKE benchmark #898
Add 65k GKE benchmark #898
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed you mentioned the infra setup and cluster loader tool makes this different to the other benchmarks in the benchmarks directory. It might still make sense to consolidate under the benchmarks directory where infra setup could go under infra - maybe even a separate 65k folder and cl2 and workload automation could go under the benchmark/tools directory.
Agree with @achandrasekar that it would be good to put this under the existing
I think tooling being different is not a blocker, but would be good to consoildiate eventually |
77a6829
to
cc18c3b
Compare
I had an attempt at reorganizing the benchmark, here is a summary of the changes:
|
Thanks for refactoring @besher-massri! @annapendleton can you take a look to make sure this doesn't affect the existing benchmarks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome work! Everything lgtm and shouldn't break existing functionality since it's just migrating locations. Only thing is we want to accurately represent what the infra does so people understand that it supports both GPU and TPU infra
Thanks for the review and comments! I pushed another commit, changing from gpu to accelerator and highlighting that both GPU and TPU clusters are supported. Changed some naming, minor rewording as well. Let me know if it's okay. |
2cdb9f1
to
660a458
Compare
660a458
to
ff251c6
Compare
That looks great, thanks! Please make sure to resolve anything lingering with Andrew prior to submission :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, deferring merge to @genlu2011
/gcbrun |
* Add 65k GKE benchmark * Move 65k benchmark into benchmarks * GPU->Accelerator rewording * Format terraform files
This PR adds a benchmark for GKE on 65,000 nodes scale
This is following up on the recent GKE announcement for support clusters with 65,000-nodes scale.
It benchmarks GKE on a (simulated) AI workload using using Terraform and ClusterLoader2.
I didn't include it as part of
benchmarks
directory as the benchmarking process here is different (using CL2) and has a different infra setup.