Prepare the System for Power Off

This procedure prepares the system to remove power from all system cabinets. Be sure the system is healthy and ready to be shut down and powered off.

The sat bootsys shutdown and sat bootsys boot commands are used to shut down the system.

Prerequisites

An authentication token is required to access the API gateway and to use the sat command. See the System Security and Authentication and "SAT Authentication" in the Shasta Admin Toolkit (SAT) product documentation.

Procedure

Obtain the user ID and passwords for system components:
1. Obtain user ID and passwords for all the system management network switches. For example:
```
sw-leaf-001
sw-leaf-002
sw-spine-001.nmn
sw-spine-002.nmn
sw-cdu-001
sw-cdu-002
```
  User id: admin
  
  Password: PASSWORD
2. If necessary, obtain the user ID and password for the ClusterStor primary management node. For example, cls01053n00.
  
  User id: admin
  
  Password: PASSWORD
3. If the Slingshot network includes edge switches, obtain the user ID and password for these switches.

Determine which Boot Orchestration Service (BOS) templates to use to shut down compute nodes and UANs. You can list all the session templates using cray bos v1 sessiontemplate list. If you are unsure of which template is in use, you can call sat status to find the xname, then use cray cfs components describe XNAME to find the bos_session, and use cray bos v1 session describe BOS_SESSION to find the templateUuid. Then finally use cray bos v1 sessiontemplate describe TEMPLATE_UUID to determine the list of xnames associated with a given template. For example:

ncn-m001# sat status | grep "Compute\|Application"

| x3000c0s19b1n0 | Node | 1        | On    | OK   | True    | X86  | River | Compute     | Sling    |
| x3000c0s19b2n0 | Node | 2        | On    | OK   | True    | X86  | River | Compute     | Sling    |
| x3000c0s19b3n0 | Node | 3        | On    | OK   | True    | X86  | River | Compute     | Sling    |
| x3000c0s19b4n0 | Node | 4        | On    | OK   | True    | X86  | River | Compute     | Sling    |
| x3000c0s27b0n0 | Node | 49169248 | On    | OK   | True    | X86  | River | Application | Sling    |

ncn-m001# cray cfs components describe x3000c0s19b1n0 | grep bos_session
bos_session = "e98cdc5d-3f2d-4fc8-a6e4-1d301d37f52f"

ncn-m001# cray bos v1 session describe e98cdc5d-3f2d-4fc8-a6e4-1d301d37f52f | grep templateUuid
templateUuid = "compute-nid1-4-sessiontemplate"

ncn-m001# cray bos v1 sessiontemplate describe Nid1-4session-compute | grep node_list
node_list = [ "x3000c0s19b1n0", "x3000c0s19b2n0", "x3000c0s19b3n0", "x3000c0s19b4n0",]

ncn-m001# cray cfs components describe x3000c0s27b0n0 | grep bos_session
bos_session = "b969c25a-3811-4a61-91d5-f1c194625748"

# cray bos v1 session describe b969c25a-3811-4a61-91d5-f1c194625748 | grep templateUuid
templateUuid = "uan-sessiontemplate"

Compute nodes: compute-nid1-4-sessiontemplate

UANs: uan-sessiontemplate

Use sat auth to authenticate to the API gateway within SAT.

See System Security and Authentication, Authenticate an Account with the Command Line, and "SAT Authentication" in the Shasta Admin Toolkit (SAT) product documentation.

Use sat to capture state of the system before the shutdown.

ncn-m001# sat bootsys shutdown --stage capture-state | tee sat.capture-state

Optional system health checks.

Use the System Dump Utility (SDU) to capture current state of system before the shutdown.

Important: SDU takes about 15 minutes to run on a small system (longer for large systems).
```
ncn-m001# sdu --scenario triage --start_time '-4 hours' \
--reason "saving state before powerdown/up"
```

Capture the state of all nodes.

ncn-m001# sat status | tee sat.status.off

Capture the list of disabled nodes.

ncn-m001# sat status --filter Enabled=false | tee sat.status.disabled

Capture the list of nodes that are off.

ncn-m001# sat status --filter State=Off | tee sat.status.off

Capture the state of nodes in the workload manager, for example, if the system uses Slurm.
```
ncn-m001# ssh uan01 sinfo | tee uan01.sinfo
```

Capture the list of down nodes in the workload manager and the reason.

ncn-m001# ssh nid000001-nmn sinfo --list-reasons | tee sinfo.reasons

Check Ceph status.
```
ncn-m001# ceph -s | tee ceph.status
```

Check k8s pod status for all pods.

ncn-m001# kubectl get pods -o wide -A | tee k8s.pods

Additional k8s status check examples :

ncn-m001# kubectl get pods -o wide -A | egrep  "CrashLoopBackOff" > k8s.pods.CLBO
ncn-m001# kubectl get pods -o wide -A | egrep  "ContainerCreating" > k8s.pods.CC
ncn-m001# kubectl get pods -o wide -A | egrep -v "Run|Completed" > k8s.pods.errors

Check HSN status.

Determine the name of the slingshot-fabric-manager pod:

ncn-m001# kubectl get pods -l app.kubernetes.io/name=slingshot-fabric-manager -n services
NAME                                        READY   STATUS    RESTARTS   AGE
slingshot-fabric-manager-5dc448779c-d8n6q   2/2     Running   0          4d21h

Run fmn_status in the slingshot-fabric-manager pod and save the output to a file:

ncn-m001# kubectl exec -it -n services slingshot-fabric-manager-5dc448779c-d8n6q \
-c slingshot-fabric-manager -- fmn_status --details | tee fabric.status

Check management switches to verify they are reachable (switch host names depend on system configuration).

ncn-m001# for switch in sw-leaf-00{1,2}.mtl sw-spine-00{1,2}.mtl sw-cdu-00{1,2}.mtl; \
do while true; do ping -c 1 $switch > /dev/null; if [[ $? == 0 ]]; then echo \
"switch $switch is up"; break; else echo "switch $switch is not yet up"; fi; sleep 5; done; done | tee switches

Check Lustre server health.

ncn-m001# ssh admin@cls01234n00.us.cray.com
admin@cls01234n00 ~]$ cscli show_nodes

From a node which has the Lustre file system mounted.
```
uan01:~ # lfs check servers
uan01:~ # lfs df
```

Check for running sessions.

ncn-m001# sat bootsys shutdown --stage session-checks | tee sat.session-checks
Checking for active BOS sessions.
Found no active BOS sessions.
Checking for active CFS sessions.
Found no active CFS sessions.
Checking for active CRUS upgrades.
Found no active CRUS upgrades.
Checking for active FAS actions.
Found no active FAS actions.
Checking for active NMD dumps.
Found no active NMD dumps.
Checking for active SDU sessions.
Found no active SDU sessions.
No active sessions exist. It is safe to proceed with the shutdown procedure.

If active sessions are running, either wait for them to complete or shut down/cancel/delete the session.

Coordinate with the site to prevent new sessions from starting in the services listed.

In version 1.4.x, there is no method to prevent new sessions from being created as long as the service APIs are accessible on the API gateway.
Follow the vendor workload manager documentation to drain processes running on compute nodes. For Slurm, the see scontrol man page and for PBS Professional, see the pbsnodes man page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare_the_System_for_Power_Off.md

Prepare_the_System_for_Power_Off.md

Prepare the System for Power Off

Prerequisites

Procedure

Files

Prepare_the_System_for_Power_Off.md

Latest commit

History

Prepare_the_System_for_Power_Off.md

File metadata and controls

Prepare the System for Power Off

Prerequisites

Procedure