CRAYSAT-1711: Improve procedure to get BOS session templates

As part of the system power off/on procedure, the admin must use BOS to shutdown the nodes and to boot the nodes. To do so, they must find the right BOS session templates to use. Currently this procedure is duplicated in three places in the documentation. Consolidate and improve the documentation in one place, the "Prepare the System for Power Off" section, and refer to it from the other two documents which need to reference the procedure for finding the appropriate BOS session templates. Also rename the procedures for booting and shutting down compute nodes and user access nodes to use the more general term "Managed Nodes" instead, which is consistent with the IUF's terminology. Update all locations to use the new titles and markdown file names. Improve and streamline the procedures in the "Power On and Boot Managed Nodes" and "Shut Down and Power Off Managed Nodes" procedures.
Cray-HPE · Jul 29, 2024 · a783c3f · a783c3f
1 parent 5f8c7af
commit a783c3f
Show file tree

Hide file tree

Showing 12 changed files with 154 additions and 165 deletions.
diff --git a/install/re-installation.md b/install/re-installation.md
@@ -21,7 +21,7 @@ the NCNs have been deployed (e.g. there is no more PIT node).
 The application and compute nodes must be shutdown prior to a reinstallation. If they are left on, then they will
 potentially end up in an undesirable state.
 
-See [Shut Down and Power Off Compute and User Access Nodes](../operations/power_management/Shut_Down_and_Power_Off_Compute_and_User_Access_Nodes.md).
+See [Shut Down and Power Off Managed Nodes](../operations/power_management/Shut_Down_and_Power_Off_Managed_Nodes.md).
 
 ## Disable DHCP service
 

diff --git a/operations/README.md b/operations/README.md
@@ -149,7 +149,7 @@ Procedures required for a full power off of an HPE Cray EX system.
 Additional links to power off sub-procedures provided for reference. Refer to the main procedure linked above before using any of these sub-procedures:
 
 - [Prepare the System for Power Off](power_management/Prepare_the_System_for_Power_Off.md)
-- [Shut Down and Power Off Compute and User Access Nodes](power_management/Shut_Down_and_Power_Off_Compute_and_User_Access_Nodes.md)
+- [Shut Down and Power Off Managed Nodes](power_management/Shut_Down_and_Power_Off_Managed_Nodes.md)
 - [Save Management Network Switch Configuration Settings](power_management/Save_Management_Network_Switch_Configurations.md)
 - Power Off Compute Cabinets
     - [Power Off Compute Cabinets](power_management/Power_Off_Compute_Cabinets.md) using CAPMC
@@ -170,7 +170,7 @@ Additional links to power on sub-procedures provided for reference. Refer to the
     - [Power On Compute Cabinets](power_management/Power_On_Compute_Cabinets.md) using CAPMC
     - [Power On Compute Cabinets](power_management/Power_Control_Service/Power_On_Compute_Cabinets.md) using PCS
 - [Power On the External Lustre File System](power_management/Power_On_the_External_Lustre_File_System.md)
-- [Power On and Boot Compute and User Access Nodes](power_management/Power_On_and_Boot_Compute_Nodes_and_User_Access_Nodes.md)
+- [Power On and Boot Managed Nodes](power_management/Power_On_and_Boot_Managed_Nodes.md)
 - Recover from a Liquid Cooled Cabinet EPO Event
     - [Recover from a Liquid Cooled Cabinet EPO Event](power_management/Recover_from_a_Liquid_Cooled_Cabinet_EPO_Event.md) using CAPMC
     - [Recover from a Liquid Cooled Cabinet EPO Event](power_management/Power_Control_Service/Recover_from_a_Liquid_Cooled_Cabinet_EPO_Event.md) using PCS

diff --git a/operations/power_management/Power_Control_Service/Power_Off_Compute_Cabinets.md b/operations/power_management/Power_Control_Service/Power_Off_Compute_Cabinets.md
@@ -24,7 +24,7 @@ HPE Cray standard EIA racks typically include two redundant PDUs. Some PDU model
 * An authentication token is required to access the API gateway and to use the `sat` command. See the "SAT Authentication" section of the HPE Cray EX System Admin Toolkit (SAT) product stream
   documentation (`S-8031`) for instructions on how to acquire a SAT authentication token.
 * This procedure assumes all system software and user jobs were shut down. See
-  [Shut Down and Power Off Compute and User Access Nodes (UAN)](../Shut_Down_and_Power_Off_Compute_and_User_Access_Nodes.md).
+  [Shut Down and Power Off Managed Nodes)](../Shut_Down_and_Power_Off_Managed_Nodes.md).
 
 ## Procedure
 

diff --git a/...agement/Power_Control_Service/Recover_from_a_Liquid_Cooled_Cabinet_EPO_Event.md b/...agement/Power_Control_Service/Recover_from_a_Liquid_Cooled_Cabinet_EPO_Event.md
@@ -177,4 +177,4 @@ If a Cray EX liquid-cooled cabinet or cooling group experiences an EPO event, th
 
 8. After the components have powered on, boot the nodes using the Boot Orchestration Services \(BOS\).
 
-    See [Power On and Boot Compute and User Access Nodes](../Power_On_and_Boot_Compute_Nodes_and_User_Access_Nodes.md).
+    See [Power On and Boot Managed Nodes](../Power_On_and_Boot_Managed_Nodes.md).
diff --git a/operations/power_management/Power_Off_Compute_Cabinets.md b/operations/power_management/Power_Off_Compute_Cabinets.md
@@ -24,7 +24,7 @@ HPE Cray standard EIA racks typically include two redundant PDUs. Some PDU model
 * An authentication token is required to access the API gateway and to use the `sat` command. See the "SAT Authentication" section of the HPE Cray EX System Admin Toolkit (SAT) product stream
   documentation (`S-8031`) for instructions on how to acquire a SAT authentication token.
 * This procedure assumes all system software and user jobs were shut down. See
-  [Shut Down and Power Off Compute and User Access Nodes (UAN)](Shut_Down_and_Power_Off_Compute_and_User_Access_Nodes.md).
+  [Shut Down and Power Off Managed Nodes](Shut_Down_and_Power_Off_Managed_Nodes.md).
 
 ## Procedure
 

diff --git a/...ot_Compute_Nodes_and_User_Access_Nodes.md → ...gement/Power_On_and_Boot_Managed_nodes.md b/...ot_Compute_Nodes_and_User_Access_Nodes.md → ...gement/Power_On_and_Boot_Managed_nodes.md
@@ -1,8 +1,9 @@
-# Power On and Boot Compute and User Access Nodes
+# Power On and Boot Managed Nodes
 
-Use Boot Orchestration Service \(BOS\) and choose the appropriate session template to power on and boot compute and UANs.
+Use the Boot Orchestration Service (BOS) and choose the appropriate session template to power on and
+boot managed nodes, e.g. compute nodes and User Access Nodes (UANs).
 
-This procedure boots all compute nodes and user access nodes \(UANs\) in the context of a full system power-up.
+This procedure boots all managed nodes in the context of a full system power-up.
 
 ## Prerequisites
 
@@ -99,31 +100,17 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
     Offline Switches:
     ```
 
-1. (`ncn-m001#`) List detailed information about the available boot orchestration service \(BOS\) session template names.
+1. (`ncn-m001#`) Set a variable to contain a comma-separated list of the BOS session templates to
+   use to boot managed nodes. For example:
 
-    Identify the BOS session template names (such as `compute-23.7.0` or `uan-23.7.0`), and choose the appropriate compute and UAN node templates for the power on and boot.
+   ```bash
+   SESSION_TEMPLATES="compute-23.7.0,uan-23.7.0"
+   ```
 
-    ```bash
-    cray bos sessiontemplates list --format json | jq -r '.[].name' | sort
-    ```
+   See [Identify BOS Session Templates for Managed Nodes](Prepare_the_System_for_Power_Off.md#identify-bos-session-templates-for-managed-nodes)
+   for instructions on obtaining the appropriate BOS session templates.
 
-    Example output excerpts:
-
-    ```text
-    compute-23.7.0
-    [...]
-    uan-23.7.0
-    ```
-
-1. (`ncn-m001#`) To display more information about a session template, for example `compute-23.7.0`, use the `describe` option.
-
-    ```bash
-    cray bos sessiontemplates describe compute-23.7.0
-    ```
-
-1. (`ncn-m001#`) Use `sat bootsys boot` to power on and boot UANs and compute nodes.
-
-    **Attention:** Specify the required session template name for `COS_SESSION_TEMPLATE` and `UAN_SESSION_TEMPLATE` in the following command line.
+1. (`ncn-m001#`) Use `sat bootsys boot` to power on and boot the managed nodes.
 
     **Important:** The default timeout for the `sat bootsys boot --stage bos-operations` command is 900 seconds.
     If it is known that the nodes take longer than this amount of time to boot, then a different value
@@ -138,7 +125,7 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
 
     ```bash
     sat bootsys boot --stage bos-operations --bos-boot-timeout BOS_BOOT_TIMEOUT \
-                --bos-templates COS_SESSION_TEMPLATE,UAN_SESSION_TEMPLATE
+                --bos-templates $SESSION_TEMPLATES
     ```
 
     Example output:
@@ -178,19 +165,20 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
     boot and to verify that the nodes reached the expected state using `sat status` commands. Both of these recommendations are shown
     in the remaining steps.
 
-1. Monitor status of the booting process.
+1. If desired, monitor status of the booting process for each BOS session.
 
-   1. (`ncn-m001#`) Use the BOS session ID to monitor the progress of the compute node boot session.
+   1. (`ncn-m001#`) Use the BOS session ID to monitor the progress of each boot session.
 
-      In the example above the compute node BOS session had the ID `76d4d98e-814d-4235-b756-4bdfaf3a2cb3`.
+      For example, to monitor the compute node boot session from the previous example use the
+      session ID `76d4d98e-814d-4235-b756-4bdfaf3a2cb3`.
 
       ```bash
       cray bos sessions status list --format json 76d4d98e-814d-4235-b756-4bdfaf3a2cb3
       ```
 
-      Example output:
+      The following example output shows a session in which all nodes successfully booted:
 
-      ```text
+      ```json
       {
         "error_summary": {},
         "managed_components_count": 12,
@@ -212,12 +200,10 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
       }
       ```
 
-   1. (`ncn-m001#`) In another shell window, use a similar command to monitor the UAN boot session.
-
-      In the example above the UAN BOS session had the ID `dacad888-e077-41f3-9ab0-65a5a45c64e5`.
+      In the following example, 33% of the 6 nodes had an issue and stayed in the powering_off phase
+      of the boot. See below for another way to determine which nodes had this issue.
 
-      ```bash
-      cray bos sessions status list --format json dacad888-e077-41f3-9ab0-65a5a45c64e5
+      ```json
       {
         "error_summary": {
           "The retry limit has been hit for this component, but no services have reported specific errors": {
@@ -244,10 +230,7 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
       }
       ```
 
-      In this example, 33% of the 6 nodes had an issue and stayed in the powering_off phase of the boot. See
-      below for another way to determine which nodes had this issue.
-
-   1. (`ncn-m001#`) Check the HSM state from `sat status` of the compute and application nodes, but not the management nodes.
+   1. (`ncn-m001#`) Check the HSM state from `sat status` of the non-management nodes.
 
       A node will progress through HSM states in this order: `Off`, `On`, `Ready`. If a node fails to leave `Off` state or
       moves from `On` to `Off` state, it needs to be investigated. If nodes are in `Standby`, that means they had been in `Ready`,
@@ -355,7 +338,7 @@ This procedure boots all compute nodes and user access nodes \(UANs\) in the con
 
       In this example, two of the application nodes have an older `Desired Config` version than the other UANs and have a last reported `Configuration Status` of pending, meaning they have not begun their CFS configuration.
 
-   1. (`ncn-m001#`) For any compute nodes or UANs which booted but failed the CFS configuration, check the CFS Ansible log for errors.
+   1. (`ncn-m001#`) For any managed nodes which booted but failed the CFS configuration, check the CFS Ansible log for errors.
 
       ```bash
       kubectl -n services --sort-by=.metadata.creationTimestamp get pods | grep cfs