Skip to content

Commit

Permalink
Merge branch 'main' into v5p
Browse files Browse the repository at this point in the history
  • Loading branch information
Obliviour authored Dec 6, 2023
2 parents 396aa98 + 7c7c4b6 commit b198a04
Show file tree
Hide file tree
Showing 2 changed files with 91 additions and 20 deletions.
65 changes: 54 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,19 @@ cleanup with a `Cluster Delete`.

## Cluster Create

First set the project and zone through gcloud config or xpk arguments.

```shell
PROJECT_ID=my-project-id
ZONE=us-east5-b
# gcloud config:
gcloud config set project $PROJECT_ID
gcloud config set compute/zone $ZONE
# xpk arguments
xpk .. --zone $ZONE --project $PROJECT_ID
```


The cluster created is a regional cluster to enable the GKE control plane across
all zones.

Expand All @@ -76,7 +89,7 @@ all zones.
```shell
# Find your reservations
gcloud compute reservations list --project=$PROJECT_ID
# Run cluster create with reservation
# Run cluster create with reservation.
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-256 \
--num-slices=2 \
Expand All @@ -91,6 +104,14 @@ all zones.
--num-slices=4 --on-demand
```

* Cluster Create (provision spot / preemptable capacity):

```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=4 --spot
```

* Cluster Create can be called again with the same `--cluster name` to modify
the number of slices or retry failed steps.

Expand All @@ -99,7 +120,7 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=4
--num-slices=4 --reservation=$RESERVATION_ID
```

and recreates the cluster with 8 slices. The command will rerun to create 4
Expand All @@ -108,7 +129,7 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=8
--num-slices=8 --reservation=$RESERVATION_ID
```

and recreates the cluster with 6 slices. The command will rerun to delete 2
Expand All @@ -118,13 +139,13 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=6
--num-slices=6 --reservation=$RESERVATION_ID
# Skip delete prompts using --force.
python3 xpk.py cluster create --force \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=6
--num-slices=6 --reservation=$RESERVATION_ID
```
## Cluster Delete
Expand Down Expand Up @@ -165,6 +186,14 @@ all zones.
xpk-test --tpu-type=v5litepod-16
```

### Set `max-restarts` for production jobs

* `--max-restarts <value>`: By default, this is 0. This will restart the job ""
times when the job terminates. For production jobs, it is recommended to
increase this to a large number, say 50. Real jobs can be interrupted due to
hardware failures and software updates. We assume your job has implemented
checkpointing so the job restarts near where it was interrupted.

### Workload Priority and Preemption
* Set the priority level of your workload with `--priority=LEVEL`

Expand Down Expand Up @@ -299,14 +328,16 @@ workload.
# More advanced facts:
* Workload create accepts a --docker-name and --docker-image.
By using custom images you can achieve very fast boots and hence very fast
feedback.
* Workload create accepts a --env-file flag to allow specifying the container's
environment from a file. Usage is the same as Docker's
[--env-file flag](https://docs.docker.com/engine/reference/commandline/run/#env)
Example File:
```shell
LIBTPU_INIT_ARGS=--my-flag=true --performance=high
MY_ENV_VAR=hello
```
* Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket.
Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads
hlo dumps to the specified GCS bucket for each worker.
Expand Down Expand Up @@ -365,10 +396,22 @@ python3 xpk.py cluster create --cluster-cpu-machine-type=CPU_TYPE ...
gcloud auth login
```
### Roles needed based on permission errors:
* `requires one of ["container.*"] permission(s)`
Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
## Reservation Troubleshooting:
### How to determine your reservation and its size / utilization:
```shell
PROJECT_ID=my-project
ZONE=us-east5-b
RESERVATION=my-reservation-name
# Find the reservations in your project
gcloud beta compute reservations list --project=$PROJECT_ID
# Find the tpu machine type and current utilization of a reservation.
gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE
```
46 changes: 37 additions & 9 deletions xpk.py
Original file line number Diff line number Diff line change
Expand Up @@ -990,13 +990,15 @@ def run_gke_cluster_create_command(args) -> int:
0 if successful and 1 otherwise.
"""

# Create the regional cluster with one CPU nodepool in the requested zone.
# Set the number of cpu nodes to start a 1 and auto-scale to fit the need.
# Create the regional cluster with `num-nodes` CPU nodes in the same zone as
# TPUs. This has been tested with clusters of 300 VMs. Larger clusters will
# benefit from a larger initial `--num-nodes`. After the cluster is created,
# the auto-scaler can reduce/increase the nodes based on the load.
command = (
'gcloud beta container clusters create'
f' {args.cluster} --release-channel rapid --enable-autoscaling'
f' --max-nodes 1000 --min-nodes 1 --node-locations={args.zone}'
' --num-nodes=1'
' --total-min-nodes 1 --total-max-nodes 1000 --num-nodes 6'
f' --node-locations={args.zone}'
f' --project={args.project} --region={zone_to_region(args.zone)}'
f' --cluster-version={args.gke_version} --location-policy=BALANCED'
f' --machine-type={args.cluster_cpu_machine_type}'
Expand Down Expand Up @@ -1081,7 +1083,7 @@ def print_reservations(args) -> int:
0 if successful and 1 otherwise.
"""
command = (
f'gcloud compute reservations list --project={args.project}'
f'gcloud beta compute reservations list --project={args.project}'
)
return_code = (
run_command_with_updates(
Expand All @@ -1093,6 +1095,30 @@ def print_reservations(args) -> int:
return 0


def verify_reservation_exists(args) -> int:
"""Verify the reservation exists.
Args:
args: user provided arguments for running the command.
Returns:
0 if successful and 1 otherwise.
"""
command = (
f'gcloud beta compute reservations describe {args.reservation}'
f' --project={args.project} --zone={args.zone}'
)
return_code = (
run_command_with_updates(
command, 'Describe reservation', args)
)
if return_code != 0:
xpk_print(f'Describe reservation returned ERROR {return_code}')
xpk_print('Please confirm that your reservation name is correct.')
return 1
return 0


def get_capacity_arguments(args) -> tuple[str, int]:
"""Determine the TPU Nodepool creation capacity arguments needed.
Expand All @@ -1112,6 +1138,9 @@ def get_capacity_arguments(args) -> tuple[str, int]:
capacity_args = ""
num_types+=1
if args.reservation:
return_code = verify_reservation_exists(args)
if return_code > 0:
return capacity_args, return_code
capacity_args = (
f'--reservation-affinity=specific --reservation={args.reservation}'
)
Expand Down Expand Up @@ -2175,11 +2204,10 @@ def directory_path_type(value):
cluster_create_optional_arguments.add_argument(
'--cluster-cpu-machine-type',
type=str,
default='e2-standard-4',
default='e2-standard-16',
help=(
'Set the machine tpu within the default cpu node pool. For zonal '
'clusters, make sure that the zone supports the machine type, and for '
'regional clusters, all zones in the region supports the machine type.'
'Set the machine tpu within the default cpu node pool. For'
' regional clusters, all zones must support the machine type.'
)
)
cluster_create_optional_arguments.add_argument(
Expand Down

0 comments on commit b198a04

Please sign in to comment.