Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 65k GKE benchmark #898

Merged
merged 4 commits into from
Dec 9, 2024

Conversation

besher-massri
Copy link
Contributor

This PR adds a benchmark for GKE on 65,000 nodes scale

This is following up on the recent GKE announcement for support clusters with 65,000-nodes scale.

It benchmarks GKE on a (simulated) AI workload using using Terraform and ClusterLoader2.

I didn't include it as part of benchmarks directory as the benchmarking process here is different (using CL2) and has a different infra setup.

Copy link
Collaborator

@achandrasekar achandrasekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed you mentioned the infra setup and cluster loader tool makes this different to the other benchmarks in the benchmarks directory. It might still make sense to consolidate under the benchmarks directory where infra setup could go under infra - maybe even a separate 65k folder and cl2 and workload automation could go under the benchmark/tools directory.

@andrewsykim
Copy link
Collaborator

andrewsykim commented Dec 3, 2024

Agree with @achandrasekar that it would be good to put this under the existing benchmarks/ directory. Is it possible to make that change?

I didn't include it as part of benchmarks directory as the benchmarking process here is different (using CL2) and has a different infra setup.

I think tooling being different is not a blocker, but would be good to consoildiate eventually

@besher-massri
Copy link
Contributor Author

Agree with Ashok Chandrasekar that it would be good to put this under the existing benchmarks/ directory. Is it possible to make that change?

I didn't include it as part of benchmarks directory as the benchmarking process here is different (using CL2) and has a different infra setup.

I think tooling being different is not a blocker, but would be good to consoildiate eventually

I had an attempt at reorganizing the benchmark, here is a summary of the changes:

  • put existing infra under infra/gpu-cluster
  • moved infra part under infra, put under folder 65k-cpu-cluster
  • moved CL2 part under benchmark/tools
  • renamed current README to gpu-benchmark.md
  • moved 65k README and renamed into 65k-nodes-simulated-ai-workload
  • added a root level README for benchmarks/ directory

@achandrasekar
Copy link
Collaborator

Thanks for refactoring @besher-massri! @annapendleton can you take a look to make sure this doesn't affect the existing benchmarks?

Copy link
Collaborator

@annapendleton annapendleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work! Everything lgtm and shouldn't break existing functionality since it's just migrating locations. Only thing is we want to accurately represent what the infra does so people understand that it supports both GPU and TPU infra

benchmarks/README.md Outdated Show resolved Hide resolved
benchmarks/gpu-benchmark.md Outdated Show resolved Hide resolved
@besher-massri
Copy link
Contributor Author

Thanks for the review and comments! I pushed another commit, changing from gpu to accelerator and highlighting that both GPU and TPU clusters are supported. Changed some naming, minor rewording as well. Let me know if it's okay.

@annapendleton
Copy link
Collaborator

That looks great, thanks! Please make sure to resolve anything lingering with Andrew prior to submission :)

Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, deferring merge to @genlu2011

@andrewsykim
Copy link
Collaborator

/gcbrun

@andrewsykim andrewsykim merged commit 6e8c342 into GoogleCloudPlatform:main Dec 9, 2024
7 checks passed
leroyjb pushed a commit to leroyjb/ai-on-gke that referenced this pull request Jan 24, 2025
* Add 65k GKE benchmark

* Move 65k benchmark into benchmarks

* GPU->Accelerator rewording

* Format terraform files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants