Skip to content

Commit

Permalink
CRAYSAT-1711: Improve procedure to get BOS session templates
Browse files Browse the repository at this point in the history
As part of the system power off/on procedure, the admin must use BOS to
shutdown the nodes and to boot the nodes. To do so, they must find the
right BOS session templates to use.

Currently this procedure is duplicated in three places in the
documentation. Consolidate and improve the documentation in one place,
the "Prepare the System for Power Off" section, and refer to it from the
other two documents which need to reference the procedure for finding
the appropriate BOS session templates.

Also rename the procedures for booting and shutting down compute nodes
and user access nodes to use the more general term "Managed Nodes" instead, which is consistent
with the IUF's terminology. Update all locations to use the new titles
and markdown file names.

Improve and streamline the procedures in the "Power On and Boot Managed
Nodes" and "Shut Down and Power Off Managed Nodes" procedures.
  • Loading branch information
haasken-hpe committed Jul 29, 2024
1 parent 5f8c7af commit a783c3f
Show file tree
Hide file tree
Showing 12 changed files with 154 additions and 165 deletions.
2 changes: 1 addition & 1 deletion install/re-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ the NCNs have been deployed (e.g. there is no more PIT node).
The application and compute nodes must be shutdown prior to a reinstallation. If they are left on, then they will
potentially end up in an undesirable state.

See [Shut Down and Power Off Compute and User Access Nodes](../operations/power_management/Shut_Down_and_Power_Off_Compute_and_User_Access_Nodes.md).
See [Shut Down and Power Off Managed Nodes](../operations/power_management/Shut_Down_and_Power_Off_Managed_Nodes.md).

## Disable DHCP service

Expand Down
4 changes: 2 additions & 2 deletions operations/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ Procedures required for a full power off of an HPE Cray EX system.
Additional links to power off sub-procedures provided for reference. Refer to the main procedure linked above before using any of these sub-procedures:

- [Prepare the System for Power Off](power_management/Prepare_the_System_for_Power_Off.md)
- [Shut Down and Power Off Compute and User Access Nodes](power_management/Shut_Down_and_Power_Off_Compute_and_User_Access_Nodes.md)
- [Shut Down and Power Off Managed Nodes](power_management/Shut_Down_and_Power_Off_Managed_Nodes.md)
- [Save Management Network Switch Configuration Settings](power_management/Save_Management_Network_Switch_Configurations.md)
- Power Off Compute Cabinets
- [Power Off Compute Cabinets](power_management/Power_Off_Compute_Cabinets.md) using CAPMC
Expand All @@ -170,7 +170,7 @@ Additional links to power on sub-procedures provided for reference. Refer to the
- [Power On Compute Cabinets](power_management/Power_On_Compute_Cabinets.md) using CAPMC
- [Power On Compute Cabinets](power_management/Power_Control_Service/Power_On_Compute_Cabinets.md) using PCS
- [Power On the External Lustre File System](power_management/Power_On_the_External_Lustre_File_System.md)
- [Power On and Boot Compute and User Access Nodes](power_management/Power_On_and_Boot_Compute_Nodes_and_User_Access_Nodes.md)
- [Power On and Boot Managed Nodes](power_management/Power_On_and_Boot_Managed_Nodes.md)
- Recover from a Liquid Cooled Cabinet EPO Event
- [Recover from a Liquid Cooled Cabinet EPO Event](power_management/Recover_from_a_Liquid_Cooled_Cabinet_EPO_Event.md) using CAPMC
- [Recover from a Liquid Cooled Cabinet EPO Event](power_management/Power_Control_Service/Recover_from_a_Liquid_Cooled_Cabinet_EPO_Event.md) using PCS
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ HPE Cray standard EIA racks typically include two redundant PDUs. Some PDU model
* An authentication token is required to access the API gateway and to use the `sat` command. See the "SAT Authentication" section of the HPE Cray EX System Admin Toolkit (SAT) product stream
documentation (`S-8031`) for instructions on how to acquire a SAT authentication token.
* This procedure assumes all system software and user jobs were shut down. See
[Shut Down and Power Off Compute and User Access Nodes (UAN)](../Shut_Down_and_Power_Off_Compute_and_User_Access_Nodes.md).
[Shut Down and Power Off Managed Nodes)](../Shut_Down_and_Power_Off_Managed_Nodes.md).

## Procedure

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -177,4 +177,4 @@ If a Cray EX liquid-cooled cabinet or cooling group experiences an EPO event, th

8. After the components have powered on, boot the nodes using the Boot Orchestration Services \(BOS\).

See [Power On and Boot Compute and User Access Nodes](../Power_On_and_Boot_Compute_Nodes_and_User_Access_Nodes.md).
See [Power On and Boot Managed Nodes](../Power_On_and_Boot_Managed_Nodes.md).
2 changes: 1 addition & 1 deletion operations/power_management/Power_Off_Compute_Cabinets.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ HPE Cray standard EIA racks typically include two redundant PDUs. Some PDU model
* An authentication token is required to access the API gateway and to use the `sat` command. See the "SAT Authentication" section of the HPE Cray EX System Admin Toolkit (SAT) product stream
documentation (`S-8031`) for instructions on how to acquire a SAT authentication token.
* This procedure assumes all system software and user jobs were shut down. See
[Shut Down and Power Off Compute and User Access Nodes (UAN)](Shut_Down_and_Power_Off_Compute_and_User_Access_Nodes.md).
[Shut Down and Power Off Managed Nodes](Shut_Down_and_Power_Off_Managed_Nodes.md).

## Procedure

Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Power On and Boot Compute and User Access Nodes
# Power On and Boot Managed Nodes

Use Boot Orchestration Service \(BOS\) and choose the appropriate session template to power on and boot compute and UANs.
Use the Boot Orchestration Service (BOS) and choose the appropriate session template to power on and
boot managed nodes, e.g. compute nodes and User Access Nodes (UANs).

This procedure boots all compute nodes and user access nodes \(UANs\) in the context of a full system power-up.
This procedure boots all managed nodes in the context of a full system power-up.

## Prerequisites

Expand Down Expand Up @@ -99,31 +100,17 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
Offline Switches:
```

1. (`ncn-m001#`) List detailed information about the available boot orchestration service \(BOS\) session template names.
1. (`ncn-m001#`) Set a variable to contain a comma-separated list of the BOS session templates to
use to boot managed nodes. For example:

Identify the BOS session template names (such as `compute-23.7.0` or `uan-23.7.0`), and choose the appropriate compute and UAN node templates for the power on and boot.
```bash
SESSION_TEMPLATES="compute-23.7.0,uan-23.7.0"
```

```bash
cray bos sessiontemplates list --format json | jq -r '.[].name' | sort
```
See [Identify BOS Session Templates for Managed Nodes](Prepare_the_System_for_Power_Off.md#identify-bos-session-templates-for-managed-nodes)
for instructions on obtaining the appropriate BOS session templates.

Example output excerpts:

```text
compute-23.7.0
[...]
uan-23.7.0
```

1. (`ncn-m001#`) To display more information about a session template, for example `compute-23.7.0`, use the `describe` option.

```bash
cray bos sessiontemplates describe compute-23.7.0
```

1. (`ncn-m001#`) Use `sat bootsys boot` to power on and boot UANs and compute nodes.

**Attention:** Specify the required session template name for `COS_SESSION_TEMPLATE` and `UAN_SESSION_TEMPLATE` in the following command line.
1. (`ncn-m001#`) Use `sat bootsys boot` to power on and boot the managed nodes.

**Important:** The default timeout for the `sat bootsys boot --stage bos-operations` command is 900 seconds.
If it is known that the nodes take longer than this amount of time to boot, then a different value
Expand All @@ -138,7 +125,7 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con

```bash
sat bootsys boot --stage bos-operations --bos-boot-timeout BOS_BOOT_TIMEOUT \
--bos-templates COS_SESSION_TEMPLATE,UAN_SESSION_TEMPLATE
--bos-templates $SESSION_TEMPLATES
```

Example output:
Expand Down Expand Up @@ -178,19 +165,20 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
boot and to verify that the nodes reached the expected state using `sat status` commands. Both of these recommendations are shown
in the remaining steps.
1. Monitor status of the booting process.
1. If desired, monitor status of the booting process for each BOS session.
1. (`ncn-m001#`) Use the BOS session ID to monitor the progress of the compute node boot session.
1. (`ncn-m001#`) Use the BOS session ID to monitor the progress of each boot session.
In the example above the compute node BOS session had the ID `76d4d98e-814d-4235-b756-4bdfaf3a2cb3`.
For example, to monitor the compute node boot session from the previous example use the
session ID `76d4d98e-814d-4235-b756-4bdfaf3a2cb3`.
```bash
cray bos sessions status list --format json 76d4d98e-814d-4235-b756-4bdfaf3a2cb3
```
Example output:
The following example output shows a session in which all nodes successfully booted:
```text
```json
{
"error_summary": {},
"managed_components_count": 12,
Expand All @@ -212,12 +200,10 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
}
```
1. (`ncn-m001#`) In another shell window, use a similar command to monitor the UAN boot session.
In the example above the UAN BOS session had the ID `dacad888-e077-41f3-9ab0-65a5a45c64e5`.
In the following example, 33% of the 6 nodes had an issue and stayed in the powering_off phase
of the boot. See below for another way to determine which nodes had this issue.
```bash
cray bos sessions status list --format json dacad888-e077-41f3-9ab0-65a5a45c64e5
```json
{
"error_summary": {
"The retry limit has been hit for this component, but no services have reported specific errors": {
Expand All @@ -244,10 +230,7 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
}
```
In this example, 33% of the 6 nodes had an issue and stayed in the powering_off phase of the boot. See
below for another way to determine which nodes had this issue.
1. (`ncn-m001#`) Check the HSM state from `sat status` of the compute and application nodes, but not the management nodes.
1. (`ncn-m001#`) Check the HSM state from `sat status` of the non-management nodes.
A node will progress through HSM states in this order: `Off`, `On`, `Ready`. If a node fails to leave `Off` state or
moves from `On` to `Off` state, it needs to be investigated. If nodes are in `Standby`, that means they had been in `Ready`,
Expand Down Expand Up @@ -355,7 +338,7 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
In this example, two of the application nodes have an older `Desired Config` version than the other UANs and have a last reported `Configuration Status` of pending, meaning they have not begun their CFS configuration.
1. (`ncn-m001#`) For any compute nodes or UANs which booted but failed the CFS configuration, check the CFS Ansible log for errors.
1. (`ncn-m001#`) For any managed nodes which booted but failed the CFS configuration, check the CFS Ansible log for errors.
```bash
kubectl -n services --sort-by=.metadata.creationTimestamp get pods | grep cfs
Expand Down
Loading

0 comments on commit a783c3f

Please sign in to comment.