Skip to content

Commit

Permalink
CRAYSAT-1882: Improve "Power On Compute Cabinets"
Browse files Browse the repository at this point in the history
Improve the procedure to power on compute cabinets as follows:

* Minor wording adjustments
* Note that it is safe to re-run `sat bootsys boot --stage
  cabinet-power` as needed until it's successful.
* Change CAPMC references and commands to PCS. Note that these
  instructions were taken from the document on the same name in the
  `Power_Control_Service` directory, and that copy of the procedure will
  be removed in CRAYSAT-1891.
  • Loading branch information
haasken-hpe authored and rustydb committed Aug 6, 2024
1 parent 7498061 commit d9624a3
Showing 1 changed file with 19 additions and 13 deletions.
32 changes: 19 additions & 13 deletions operations/power_management/Power_On_Compute_Cabinets.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,23 +47,29 @@ power-on command from Cray System Management \(CSM\) software.
sat bootsys boot --stage cabinet-power
```

This command first resumes the `hms-discovery` Kubernetes cronjob and waits for it to be
scheduled. Then, the `hms-discovery` job initiates power-on of the liquid-cooled cabinets.
Finally, the `sat bootsys` command waits for the components in the liquid-cooled cabinets to be
powered on. The `sat bootsys` command controls power only to liquid-cooled cabinets.
This command resumes the `hms-discovery` Kubernetes cronjob and waits for it to be scheduled.
Once scheduled, the `hms-discovery` job initiates power-on of the liquid-cooled cabinets, and the
`sat bootsys` command waits for the components in the liquid-cooled cabinets to be powered on.
The `sat bootsys` command only powers on liquid-cooled cabinets.

If the `hms-discovery` cronjob fails to be scheduled after it is resumed, then SAT will delete
and re-create the cronjob, and will wait for it to run. After the cronjob has been scheduled
within the time expected based on its cron schedule, execute the `sat bootsys boot --stage
cabinet-power` command again.
If the `hms-discovery` cronjob fails to be scheduled after it is resumed, then `sat bootsys` will
delete and re-create the cronjob and wait again for it to be scheduled. If this command fails, it is safe to run it again until it succeeds.

If `sat bootsys` fails to power on the cabinets through `hms-discovery`, then use CAPMC to manually power on the cabinet chassis,
compute blade slots, and all populated switch blade slots \(1, 3, 5, and 7\). This example shows cabinets 1000-1003.
If `sat bootsys` fails to power on the cabinets through `hms-discovery`, then components can be
manually powered on directly with PCS. The example below will power on the cabinet chassis,
compute blade slots, and all populated switch blade slots (1, 3, 5, and 7) in cabinets 1000-1003.
Adjust the example as needed for the system.

```bash
cray capmc xname_on create --xnames x[1000-1003]c[0-7] --format json
cray capmc xname_on create --xnames x[1000-1003]c[0-7]s[0-7] --format json
cray capmc xname_on create --xnames x[1000-1003]c[0-7]r[1,3,5,7] --format json
cray power transition on --xnames "x[1000-1003]c[0-7]" --format json
cray power transition on --xnames "x[1000-1003]c[0-7]s[0-7]" --format json
cray power transition on --xnames "x[1000-1003]c[0-7]r[1,3,5,7]" --format json
```

Verify the status of each of the power operations.

```bash
cray power transition describe TRANSITION_ID --format json
```

### Power On Standard Rack PDU Circuit Breakers
Expand Down

0 comments on commit d9624a3

Please sign in to comment.