Skip to content

Commit

Permalink
Merge branch 'main' into stack-trace
Browse files Browse the repository at this point in the history
  • Loading branch information
SurbhiJainUSC committed Dec 8, 2023
2 parents 2e74a61 + 9d74584 commit 88500f9
Show file tree
Hide file tree
Showing 4 changed files with 468 additions and 26 deletions.
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,25 @@ To release a new version (e.g. from `1.0.0` -> `2.0.0`):
-->

## [Unreleased]
- Move away from static GKE version and use RAPID release default.

## [0.2.0] - 2023-12-07

### Added
- Add a reservation exists check and provide help if this errors
- Add error message and self-help instructions to readme for troubleshooting problems
- Add v5p support
- Add xpk cluster create flags for reservation/on-demand/spot
- Change GKE version to 1.28.3-gke.1286000
- Change cpu node pool defaults to be better adapted to demand
- Fix empty results from filter-by-status=QUEUED / FAILED / RUNNING
- Fix parallel execution of node pool commands (concurrent ops)
- Fix pip-changelog to the wrong package

## [0.1.0] - 2023-11-17

### Added
- Initial release of xpk PyPI package

[0.1.0]: https://github.com/google/xpk/releases/tag/v0.1.0
[0.2.0]: https://github.com/google/xpk/compare/v0.1.0...v0.2.0
70 changes: 59 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,11 @@ return the hardware back to the shared pool when they complete, developers can
achieve better use of finite hardware resources. And automated tests can run
overnight while resources tend to be underutilized.

xpk supports the following TPU types:
* v4
* v5e
* v5p

# Installation
To install xpk, run the following command:

Expand Down Expand Up @@ -63,6 +68,19 @@ cleanup with a `Cluster Delete`.

## Cluster Create

First set the project and zone through gcloud config or xpk arguments.

```shell
PROJECT_ID=my-project-id
ZONE=us-east5-b
# gcloud config:
gcloud config set project $PROJECT_ID
gcloud config set compute/zone $ZONE
# xpk arguments
xpk .. --zone $ZONE --project $PROJECT_ID
```


The cluster created is a regional cluster to enable the GKE control plane across
all zones.

Expand All @@ -71,7 +89,7 @@ all zones.
```shell
# Find your reservations
gcloud compute reservations list --project=$PROJECT_ID
# Run cluster create with reservation
# Run cluster create with reservation.
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-256 \
--num-slices=2 \
Expand All @@ -86,6 +104,14 @@ all zones.
--num-slices=4 --on-demand
```

* Cluster Create (provision spot / preemptable capacity):

```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=4 --spot
```

* Cluster Create can be called again with the same `--cluster name` to modify
the number of slices or retry failed steps.

Expand All @@ -94,7 +120,7 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=4
--num-slices=4 --reservation=$RESERVATION_ID
```

and recreates the cluster with 8 slices. The command will rerun to create 4
Expand All @@ -103,7 +129,7 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=8
--num-slices=8 --reservation=$RESERVATION_ID
```

and recreates the cluster with 6 slices. The command will rerun to delete 2
Expand All @@ -113,13 +139,13 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=6
--num-slices=6 --reservation=$RESERVATION_ID
# Skip delete prompts using --force.
python3 xpk.py cluster create --force \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=6
--num-slices=6 --reservation=$RESERVATION_ID
```
## Cluster Delete
Expand Down Expand Up @@ -160,6 +186,14 @@ all zones.
xpk-test --tpu-type=v5litepod-16
```

### Set `max-restarts` for production jobs

* `--max-restarts <value>`: By default, this is 0. This will restart the job ""
times when the job terminates. For production jobs, it is recommended to
increase this to a large number, say 50. Real jobs can be interrupted due to
hardware failures and software updates. We assume your job has implemented
checkpointing so the job restarts near where it was interrupted.

### Workload Priority and Preemption
* Set the priority level of your workload with `--priority=LEVEL`

Expand Down Expand Up @@ -326,14 +360,16 @@ workload.
# More advanced facts:
* Workload create accepts a --docker-name and --docker-image.
By using custom images you can achieve very fast boots and hence very fast
feedback.
* Workload create accepts a --env-file flag to allow specifying the container's
environment from a file. Usage is the same as Docker's
[--env-file flag](https://docs.docker.com/engine/reference/commandline/run/#env)
Example File:
```shell
LIBTPU_INIT_ARGS=--my-flag=true --performance=high
MY_ENV_VAR=hello
```
* Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket.
Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads
hlo dumps to the specified GCS bucket for each worker.
Expand Down Expand Up @@ -392,10 +428,22 @@ python3 xpk.py cluster create --cluster-cpu-machine-type=CPU_TYPE ...
gcloud auth login
```
### Roles needed based on permission errors:
* `requires one of ["container.*"] permission(s)`
Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
## Reservation Troubleshooting:
### How to determine your reservation and its size / utilization:
```shell
PROJECT_ID=my-project
ZONE=us-east5-b
RESERVATION=my-reservation-name
# Find the reservations in your project
gcloud beta compute reservations list --project=$PROJECT_ID
# Find the tpu machine type and current utilization of a reservation.
gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE
```
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

[project]
name = "xpk"
version = "0.1.0"
version = "0.2.0"
authors = [
{ name="Cloud TPU Team", email="cloud-tpu-eng@google.com" },
]
Expand Down
Loading

0 comments on commit 88500f9

Please sign in to comment.