Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding CPU support in XPK #56

Merged
merged 1 commit into from
Feb 5, 2024
Merged

Adding CPU support in XPK #56

merged 1 commit into from
Feb 5, 2024

Conversation

RoshaniN
Copy link
Collaborator

@RoshaniN RoshaniN commented Jan 16, 2024

Features

  • Adding single slice CPU support for XPK, works with n2-standard-32 machines.

Major changes:

  • allow for CPU device-type in the format <"n2-standard-32">-<# of n2-standard-32 VMs>
  • to avoid confusion with cluster's default pool CPU type, deprecated the argument "--cluster-cpu-machine-type" in favor of "--default-pool-cpu-machine-type" and added error check.
  • added nodeAffinity to workload create spec, which is useful for CPUs to ensure that workload pods don't get scheduled onto default pool machines.
  • introduced CPU cluster env to inject JAX variables into GKE

Testing / Documentation

  • Tested cluster creation, CPU nodepool creation with args and env properly set.

  • Ran Maxtext workloads on CPUs using these changes.

  • Documented changes in README as well.

  • Verified that TPU training (with Maxtext workloads) run as expected.

  • [ y ] Tests pass

  • [ y ] Appropriate changes to documentation are included in the PR

@RoshaniN RoshaniN marked this pull request as draft January 16, 2024 19:33
xpk.py Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved
@RoshaniN RoshaniN force-pushed the xpk-cpu branch 4 times, most recently from 4ed211a to f0426ca Compare January 24, 2024 01:13
@RoshaniN RoshaniN force-pushed the xpk-cpu branch 5 times, most recently from 9c9b6e9 to d5073b0 Compare January 29, 2024 20:08
@RoshaniN RoshaniN marked this pull request as ready for review January 29, 2024 20:08
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also document how to emulate multiple slices using a single CPU nodepool?

Copy link
Collaborator Author

@RoshaniN RoshaniN Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slices are emulated using Maxtext parallelism types, hence I think it would belong in Maxtext emulator for strorage documentation.

xpk.py Show resolved Hide resolved
@RoshaniN RoshaniN force-pushed the xpk-cpu branch 2 times, most recently from 6751dbc to 759f0a6 Compare February 1, 2024 18:21
Copy link
Collaborator

@Obliviour Obliviour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Very clean set of changes. Excited to see multiple node pool support for CPUs

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
--workload xpk-test-workload \
--command="echo hello world"
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description looks great!

Will you have a maxtext with xpk + cpu guide in the maxtext repo, if so can you link that here when ready?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to have the documentation in Maxtext emulator for storage at the moment, but can definitely build a guide in Maxtext soon.

xpk.py Show resolved Hide resolved
xpk.py Show resolved Hide resolved
xpk.py Outdated Show resolved Hide resolved
xpk.py Outdated
# cluster_cpu_machine_type is being deprecated!
if args.cluster_cpu_machine_type != '':
xpk_print('Note that cluster-cpu-machine-type is deprecated,',
'please use default-pool-cpu-machine-type instead!',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the message you used in the argument description:

Can we rephrase to:

'Note that `--cluster-cpu-machine-type` is deprecated,',
' Please use `--default-pool-cpu-machine-type={args.cluster_cpu_machine_type}` instead,'
' to denote the machine type of the default cpu node pool. Set'
' the machine type of other cpu nodepools using `--device-type`.'

xpk.py Show resolved Hide resolved
xpk.py Show resolved Hide resolved
@RoshaniN RoshaniN merged commit 650e47e into main Feb 5, 2024
3 checks passed
@RoshaniN RoshaniN deleted the xpk-cpu branch February 5, 2024 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants