From df463872600b859a51eeb3d8acb19160a00152eb Mon Sep 17 00:00:00 2001 From: Harold Longley Date: Thu, 25 Jul 2024 14:16:30 -0500 Subject: [PATCH 1/6] CASMTRIAGE-7185: Add Power Off Management Cabinets procedure --- .../Power_Off_Management_Cabinets.md | 27 +++++++++++++++++++ ...r_Off_the_Management_Kubernetes_Cluster.md | 9 ------- .../System_Power_Off_Procedures.md | 4 +++ 3 files changed, 31 insertions(+), 9 deletions(-) create mode 100644 operations/power_management/Power_Off_Management_Cabinets.md diff --git a/operations/power_management/Power_Off_Management_Cabinets.md b/operations/power_management/Power_Off_Management_Cabinets.md new file mode 100644 index 000000000000..14c03a02462c --- /dev/null +++ b/operations/power_management/Power_Off_Management_Cabinets.md @@ -0,0 +1,27 @@ +# Power Off Management Cabinets + +Power off PDUs and any remaining components in management cabinets which are powered on, such as Slingshot switches, management switches, and a KVM device. + +## Power Off Management Cabinet PDU circuit breakers + +**CAUTION:** The nodes and switches in management cabinets should only +be powered off once it has been confirmed that the management Kuberenets cluster and any Lustre or Spectrum Scale filesystems in the cabinets have been cleanly shut down. See the procedures in +[Power Off the External File Systems](System_Power_Off_Procedures.md#Power_off_the_External_File_systems.md) +and +[Shut Down and Power Off the Management Kubernetes Cluster](file:///Users/htg/git/shasta/20240717/docs-csm-1.4/operations/power_management/Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md). + +1. (Optional) Power down Modular coolant distribution unit (MDCU) in a liquid-cooled HPE Cray EX2000 cabinet. + + CAUTION: The modular coolant distribution unit (MDCU) in a liquid-cooled HPE Cray EX2000 cabinet (also referred to as a Hill or TDS cabinet) typically receives power from its management cabinet PDUs. If the system includes an EX2000 cabinet, then do not power off the management cabinet PDUs until the MDCU has been powered off. Powering off the MDCU will cause an emergency power off (EPO) of the cabinet and may result in data loss or equipment damage. + +1. Set each management cabinet PDU circuit breaker to `OFF`. + + A slotted screwdriver may be required to open PDU circuit breakers. + +1. To power off Motivair liquid-cooled chilled doors and CDUs, locate the power off switch on the CDU control panel and set it to `OFF`. + + Refer to vendor documentation for the chilled-door cooling system for power control procedures when chilled doors are installed on standard racks. + +## Next step + +Return to [System Power Off Procedures](System_Power_Off_Procedures.md) and continue with next step. diff --git a/operations/power_management/Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md b/operations/power_management/Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md index 3f41bb0eb876..711dd81a6b58 100644 --- a/operations/power_management/Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md +++ b/operations/power_management/Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md @@ -421,15 +421,6 @@ documentation (`S-8031`) for instructions on how to acquire a SAT authentication ipmitool -I lanplus -U "${USERNAME}" -E -H NCN-M001_BMC_HOSTNAME chassis power status ``` -1. (Optional) Power down Modular coolant distribution unit (MDCU) in a liquid-cooled HPE Cray EX20000 cabinet. - - **CAUTION:** The modular coolant distribution unit \(MDCU\) in a liquid-cooled HPE Cray EX2000 cabinet (also referred to as a Hill or TDS cabinet) typically receives power from its management - cabinet PDUs. If the system includes an EX2000 cabinet, then **do not power off** the management cabinet PDUs. Powering off the MDCU will cause an emergency power off \(EPO\) of the cabinet and - may result in data loss or equipment damage. - - 1. (Optional) If a liquid-cooled EX2000 cabinet is not receiving MCDU power from this management cabinet, then power off the PDU circuit breakers or disconnect the PDUs from facility power and - follow lock out/tag out procedures for the site. - ## Next step Return to [System Power Off Procedures](System_Power_Off_Procedures.md) and continue with next step. diff --git a/operations/power_management/System_Power_Off_Procedures.md b/operations/power_management/System_Power_Off_Procedures.md index 71df07ff596e..01ca58094fa4 100644 --- a/operations/power_management/System_Power_Off_Procedures.md +++ b/operations/power_management/System_Power_Off_Procedures.md @@ -41,6 +41,10 @@ To power off standard racks which have only storage nodes and switches, refer to To shut down the management Kubernetes cluster, refer to [Shut Down and Power Off the Management Kubernetes Cluster](Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md). +## Power Off Management Cabinets + +To power off management cabinets, refer to [Power Off Management Cabinets](Power_Off_Management_Cabinets.md). + ## `Lockout Tagout` Facility Power If facility power must be removed from a single cabinet or cabinet group for maintenance, follow proper `lockout-tagout` procedures for the site. From 0fc90aa7b4d6258deb206b2341ed1d1e613e1179 Mon Sep 17 00:00:00 2001 From: Harold Longley Date: Thu, 25 Jul 2024 15:12:12 -0500 Subject: [PATCH 2/6] CASMTRIAGE-7185 fix link and spelling errors --- .../Power_Off_Management_Cabinets.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/operations/power_management/Power_Off_Management_Cabinets.md b/operations/power_management/Power_Off_Management_Cabinets.md index 14c03a02462c..1726a479bb90 100644 --- a/operations/power_management/Power_Off_Management_Cabinets.md +++ b/operations/power_management/Power_Off_Management_Cabinets.md @@ -5,14 +5,17 @@ Power off PDUs and any remaining components in management cabinets which are pow ## Power Off Management Cabinet PDU circuit breakers **CAUTION:** The nodes and switches in management cabinets should only -be powered off once it has been confirmed that the management Kuberenets cluster and any Lustre or Spectrum Scale filesystems in the cabinets have been cleanly shut down. See the procedures in -[Power Off the External File Systems](System_Power_Off_Procedures.md#Power_off_the_External_File_systems.md) -and -[Shut Down and Power Off the Management Kubernetes Cluster](file:///Users/htg/git/shasta/20240717/docs-csm-1.4/operations/power_management/Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md). +be powered off once it has been confirmed that the management Kubernetes cluster and any Lustre or Spectrum Scale filesystems in the cabinets have been cleanly shut down. See the procedures in +[Power Off the External File Systems](System_Power_Off_Procedures.md#Power_off_the_External_File_systems) +and [Shut Down and Power Off the Management Kubernetes Cluster](file:///Users/htg/git/shasta/20240717/docs-csm-1.4/operations/power_management/Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md). 1. (Optional) Power down Modular coolant distribution unit (MDCU) in a liquid-cooled HPE Cray EX2000 cabinet. - CAUTION: The modular coolant distribution unit (MDCU) in a liquid-cooled HPE Cray EX2000 cabinet (also referred to as a Hill or TDS cabinet) typically receives power from its management cabinet PDUs. If the system includes an EX2000 cabinet, then do not power off the management cabinet PDUs until the MDCU has been powered off. Powering off the MDCU will cause an emergency power off (EPO) of the cabinet and may result in data loss or equipment damage. + CAUTION: The modular coolant distribution unit (MDCU) in a liquid-cooled HPE Cray EX2000 cabinet (also +referred to as a Hill or TDS cabinet) typically receives power from its management cabinet PDUs. If the +system includes an EX2000 cabinet, then do not power off the management cabinet PDUs until the MDCU has +been powered off. Powering off the MDCU will cause an emergency power off (EPO) of the cabinet and may +result in data loss or equipment damage. 1. Set each management cabinet PDU circuit breaker to `OFF`. From eea80d4021795403dc044fdfc0e654c242d07ae5 Mon Sep 17 00:00:00 2001 From: Harold Longley Date: Thu, 25 Jul 2024 15:35:01 -0500 Subject: [PATCH 3/6] CASMTRIAGE-7185: correct link --- operations/power_management/Power_Off_Management_Cabinets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/operations/power_management/Power_Off_Management_Cabinets.md b/operations/power_management/Power_Off_Management_Cabinets.md index 1726a479bb90..3e520270e206 100644 --- a/operations/power_management/Power_Off_Management_Cabinets.md +++ b/operations/power_management/Power_Off_Management_Cabinets.md @@ -7,7 +7,7 @@ Power off PDUs and any remaining components in management cabinets which are pow **CAUTION:** The nodes and switches in management cabinets should only be powered off once it has been confirmed that the management Kubernetes cluster and any Lustre or Spectrum Scale filesystems in the cabinets have been cleanly shut down. See the procedures in [Power Off the External File Systems](System_Power_Off_Procedures.md#Power_off_the_External_File_systems) -and [Shut Down and Power Off the Management Kubernetes Cluster](file:///Users/htg/git/shasta/20240717/docs-csm-1.4/operations/power_management/Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md). +and [Shut Down and Power Off the Management Kubernetes Cluster](Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md). 1. (Optional) Power down Modular coolant distribution unit (MDCU) in a liquid-cooled HPE Cray EX2000 cabinet. From 6165d4ed3f9761f93e64545669f15e4eacaa9634 Mon Sep 17 00:00:00 2001 From: Harold Longley Date: Tue, 30 Jul 2024 08:10:04 -0500 Subject: [PATCH 4/6] CASMTRIAGE-7185: changed once to when --- operations/power_management/Power_Off_Management_Cabinets.md | 2 +- operations/power_management/Power_Off_Storage_Cabinets.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/operations/power_management/Power_Off_Management_Cabinets.md b/operations/power_management/Power_Off_Management_Cabinets.md index 3e520270e206..fd22f96a465b 100644 --- a/operations/power_management/Power_Off_Management_Cabinets.md +++ b/operations/power_management/Power_Off_Management_Cabinets.md @@ -5,7 +5,7 @@ Power off PDUs and any remaining components in management cabinets which are pow ## Power Off Management Cabinet PDU circuit breakers **CAUTION:** The nodes and switches in management cabinets should only -be powered off once it has been confirmed that the management Kubernetes cluster and any Lustre or Spectrum Scale filesystems in the cabinets have been cleanly shut down. See the procedures in +be powered off when it has been confirmed that the management Kubernetes cluster and any Lustre or Spectrum Scale filesystems in the cabinets have been cleanly shut down. See the procedures in [Power Off the External File Systems](System_Power_Off_Procedures.md#Power_off_the_External_File_systems) and [Shut Down and Power Off the Management Kubernetes Cluster](Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md). diff --git a/operations/power_management/Power_Off_Storage_Cabinets.md b/operations/power_management/Power_Off_Storage_Cabinets.md index 43720c6b3b21..23b335e87fe4 100644 --- a/operations/power_management/Power_Off_Storage_Cabinets.md +++ b/operations/power_management/Power_Off_Storage_Cabinets.md @@ -5,7 +5,7 @@ Power off storage nodes and management switches in standard racks. ## Power off standard rack PDU circuit breakers **CAUTION:** The Lustre or Spectrum Scale (GPFS) file systems on nodes and switches in storage cabinets should only -be powered off once it has been confirmed that the filesystems have been cleanly shut down. See the procedures in +be powered off when it has been confirmed that the filesystems have been cleanly shut down. See the procedures in [Power Off the External File Systems](System_Power_Off_Procedures.md#Power_off_the_External_File_systems). 1. Set each cabinet PDU circuit breaker to `OFF`. From 9324207dfaed7652d40ebd0f883ccd17cef6b562 Mon Sep 17 00:00:00 2001 From: Harold Longley Date: Thu, 1 Aug 2024 16:06:03 -0500 Subject: [PATCH 5/6] CASMTRIAGE-7185: address review comments for MCDU and Warning --- .../Power_Off_Management_Cabinets.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/operations/power_management/Power_Off_Management_Cabinets.md b/operations/power_management/Power_Off_Management_Cabinets.md index fd22f96a465b..097ade3dae03 100644 --- a/operations/power_management/Power_Off_Management_Cabinets.md +++ b/operations/power_management/Power_Off_Management_Cabinets.md @@ -1,6 +1,6 @@ # Power Off Management Cabinets -Power off PDUs and any remaining components in management cabinets which are powered on, such as Slingshot switches, management switches, and a KVM device. +Power off PDUs and any remaining components in management cabinets which are powered on, such as HPE Slingshot switches, management switches, and a KVM device. ## Power Off Management Cabinet PDU circuit breakers @@ -9,12 +9,14 @@ be powered off when it has been confirmed that the management Kubernetes cluster [Power Off the External File Systems](System_Power_Off_Procedures.md#Power_off_the_External_File_systems) and [Shut Down and Power Off the Management Kubernetes Cluster](Shut_Down_and_Power_Off_the_Management_Kubernetes_Cluster.md). -1. (Optional) Power down Modular coolant distribution unit (MDCU) in a liquid-cooled HPE Cray EX2000 cabinet. +1. (Optional) Power down the modular coolant distribution unit (MCDU) in a liquid-cooled HPE Cray EX2000 cabinet. - CAUTION: The modular coolant distribution unit (MDCU) in a liquid-cooled HPE Cray EX2000 cabinet (also + The MCDU in a liquid-cooled HPE Cray EX2000 cabinet (also referred to as a Hill or TDS cabinet) typically receives power from its management cabinet PDUs. If the -system includes an EX2000 cabinet, then do not power off the management cabinet PDUs until the MDCU has -been powered off. Powering off the MDCU will cause an emergency power off (EPO) of the cabinet and may +system includes an EX2000 cabinet, then do not power off the management cabinet PDUs until the MCDU has +been powered off. + + **WARNING:** Dropping power to the management cabinet PDUs without powering off the MCDU will cause an emergency power off (EPO) of the cabinet and may result in data loss or equipment damage. 1. Set each management cabinet PDU circuit breaker to `OFF`. From 551a725785edf3a942afcfe6158c26453b271811 Mon Sep 17 00:00:00 2001 From: Harold Longley Date: Thu, 1 Aug 2024 16:09:36 -0500 Subject: [PATCH 6/6] CASMTRIAGE-7185: fix markdown lint issue --- operations/power_management/Power_Off_Management_Cabinets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/operations/power_management/Power_Off_Management_Cabinets.md b/operations/power_management/Power_Off_Management_Cabinets.md index 097ade3dae03..1503163884d2 100644 --- a/operations/power_management/Power_Off_Management_Cabinets.md +++ b/operations/power_management/Power_Off_Management_Cabinets.md @@ -14,7 +14,7 @@ and [Shut Down and Power Off the Management Kubernetes Cluster](Shut_Down_and_Po The MCDU in a liquid-cooled HPE Cray EX2000 cabinet (also referred to as a Hill or TDS cabinet) typically receives power from its management cabinet PDUs. If the system includes an EX2000 cabinet, then do not power off the management cabinet PDUs until the MCDU has -been powered off. +been powered off. **WARNING:** Dropping power to the management cabinet PDUs without powering off the MCDU will cause an emergency power off (EPO) of the cabinet and may result in data loss or equipment damage.