CRAYSAT-1882: Update power off/on procedures for `sat bootsys` improvements #5282

haasken-hpe · 2024-08-05T13:45:23Z

Description

Changes were made to the sat bootsys command to improve the full-system power off/on procedures in docs-csm. This pull request updates the documentation to match the current behavior of sat bootsys.

Details from commit messages

CRAYSAT-1882: Improve "Power On Compute Cabinets"

Improve the procedure to power on compute cabinets as follows:
- Minor wording adjustments
- Note that it is safe to re-run sat bootsys boot --stage cabinet-power as needed until it's successful.
- Change CAPMC references and commands to PCS. Note that these
  instructions were taken from the document on the same name in the
  Power_Control_Service directory, and that copy of the procedure will
  be removed in CRAYSAT-1891.
CRAYSAT-1882: Improve steps for shutting down K8s cluster

Improve the steps in the "Shut Down and Power Off the Management
Kubernetes Cluster" section of the "System Power Off Procedures". This
includes the following changes:
- Include more complete example output from the ncn-power stage of
  sat bootsys shutdown, so that the admin knows what to expect.
- Add sub-steps describing what to do in known exceptional circumstances
  that can occur during the ncn-power stage of shutdown.
- Improve console monitoring instructions, similarly to how they were
  improved for the boot process.
- Remove unnecessary ipmitool commands that query for power status of
  NCNs using ipmitool. This is already done by the sat bootsys shutdown --stage ncn-power command.
CRAYSAT-1882: Improve steps for starting Kubernetes cluster

Improve the steps in the "Power On and Start the Management Kubernetes
Cluster" section of the "System Power On Procedures". This includes the
following changes:
- Show the successful output of the ncn-power stage first. Then show
  exceptional conditions in sub-steps. Improve the wording.
- Ensure all log messages in example output are appropriately prefixed
  with the log level as they will be with the latest version of sat.
- Improve documentation around monitoring node console logs to include a
  tail command and describe using screen in more detail.
- Remove the Ceph troubleshooting from the platform-services stage
  since it has been moved to the ncn-power stage.
- Add a general reminder to re-run the platform-services stage if any
  unexpected errors occur.
CRAYSAT-1877: Improvement on platform services stage
CRAYSAT-1882: sat bootsys shutdown and boot improvements doc changes (CRAYSAT-1882: sat bootsys shutdown and boot improvements doc changes #5262)
- As part of sat bootsys shutdown and boot, improvements have been done.
Changing the order off booting the ncn's.
Adding prompt to bypass ceph health wait
updating SAT command to shutdown cabinets using PCS
PCS command inplace of CAPMC cmd
Move the ceph troubleshoot under ncn-power stage
CRAYSAT-1711: Improve procedure to get BOS session templates (CRAYSAT-1711: Improve procedure to get BOS session templates #5260)

As part of the system power off/on procedure, the admin must use BOS to
shutdown the nodes and to boot the nodes. To do so, they must find the
right BOS session templates to use.

Currently this procedure is duplicated in three places in the
documentation. Consolidate and improve the documentation in one place,
the "Prepare the System for Power Off" section, and refer to it from the
other two documents which need to reference the procedure for finding
the appropriate BOS session templates.

Also rename the procedures for booting and shutting down compute nodes
and user access nodes to use the more general term "Managed Nodes" instead, which is consistent
with the IUF's terminology. Update all locations to use the new titles
and markdown file names.

Improve and streamline the procedures in the "Power On and Boot Managed
Nodes" and "Shut Down and Power Off Managed Nodes" procedures.
CRAYSAT-1715 Removing the manual process there to mount as it is automated. (CRAYSAT-1715 Removing the manual process there to mount as it is automated. #5143)

Removing the manual process there to mount as it is automated.

IM:CRAYSAT-1715
Reviewer:Ryan

Currently the user mounts the s3fs filesystem manually. This has been automated.

CRAYSAT-1852 automates by mounting the file systems on m001 before booting the
other master(m002,m003) and worker nodes. Thus after the boot mount points would
be readily available on the nodes. Hence removing the mnaul process documented.

Checklist

If I added any command snippets, the steps they belong to follow the prompt conventions (see example).
If I added a new directory, I also updated .github/CODEOWNERS with the corresponding team in Cray-HPE.
My commits or Pull-Request Title contain my JIRA information, or I do not have a JIRA.

…mated. (#5143) Removing the manual process there to mount as it is automated. IM:CRAYSAT-1715 Reviewer:Ryan Currently the user mounts the s3fs filesystem manually. This has been automated. CRAYSAT-1852 automates by mounting the file systems on m001 before booting the other master(m002,m003) and worker nodes. Thus after the boot mount points would be readily available on the nodes. Hence removing the mnaul process documented. Co-authored-by: Shivaprasad Ashok Metimath <shivaprasad-ashok.metimath@hpe.com>

As part of the system power off/on procedure, the admin must use BOS to shutdown the nodes and to boot the nodes. To do so, they must find the right BOS session templates to use. Currently this procedure is duplicated in three places in the documentation. Consolidate and improve the documentation in one place, the "Prepare the System for Power Off" section, and refer to it from the other two documents which need to reference the procedure for finding the appropriate BOS session templates. Also rename the procedures for booting and shutting down compute nodes and user access nodes to use the more general term "Managed Nodes" instead, which is consistent with the IUF's terminology. Update all locations to use the new titles and markdown file names. Improve and streamline the procedures in the "Power On and Boot Managed Nodes" and "Shut Down and Power Off Managed Nodes" procedures.

…5262) * As part of sat bootsys shutdown and boot, improvements have been done. Changing the order off booting the ncn's. Adding prompt to bypass ceph health wait updating SAT command to shutdown cabinets using PCS PCS command inplace of CAPMC cmd Move the ceph troubleshoot under ncn-power stage * Apply suggestions from code review Signed-off-by: Nathan Rockershousen <nathan.rockershousen@hpe.com> --------- Signed-off-by: Nathan Rockershousen <nathan.rockershousen@hpe.com> Co-authored-by: Shivaprasad Ashok Metimath <shivaprasad-ashok.metimath@hpe.com> Co-authored-by: Nathan Rockershousen <nathan.rockershousen@hpe.com>

Improve the steps in the "Power On and Start the Management Kubernetes Cluster" section of the "System Power On Procedures". This includes the following changes: * Show the successful output of the `ncn-power` stage first. Then show exceptional conditions in sub-steps. Improve the wording. * Ensure all log messages in example output are appropriately prefixed with the log level as they will be with the latest version of `sat`. * Improve documentation around monitoring node console logs to include a `tail` command and describe using `screen` in more detail. * Remove the Ceph troubleshooting from the `platform-services` stage since it has been moved to the `ncn-power` stage. * Add a general reminder to re-run the `platform-services` stage if any unexpected errors occur.

Improve the steps in the "Shut Down and Power Off the Management Kubernetes Cluster" section of the "System Power Off Procedures". This includes the following changes: * Include more complete example output from the `ncn-power` stage of `sat bootsys shutdown`, so that the admin knows what to expect. * Add sub-steps describing what to do in known exceptional circumstances that can occur during the `ncn-power` stage of shutdown. * Improve console monitoring instructions, similarly to how they were improved for the boot process. * Remove unnecessary `ipmitool` commands that query for power status of NCNs using `ipmitool`. This is already done by the `sat bootsys shutdown --stage ncn-power` command.

Improve the procedure to power on compute cabinets as follows: * Minor wording adjustments * Note that it is safe to re-run `sat bootsys boot --stage cabinet-power` as needed until it's successful. * Change CAPMC references and commands to PCS. Note that these instructions were taken from the document on the same name in the `Power_Control_Service` directory, and that copy of the procedure will be removed in CRAYSAT-1891.

haasken-hpe · 2024-08-05T16:21:06Z

The markdown-link-check failures appear to be existing bad links unrelated to the changes made in this PR.

shivaprasad-metimath and others added 7 commits August 5, 2024 08:40

CRAYSAT-1877: Improvement on platform services stage

6234cd4

haasken-hpe requested review from haroldlongley, shivaprasad-metimath and annapoorna-s-alt August 5, 2024 13:45

haasken-hpe requested review from a team as code owners August 5, 2024 13:45

haasken-hpe requested review from jnowicki-hpe, nrockershousen, shunr-hpe, GunashekarKM, SavioJoshua, sfkramer, alexanderkingh, brookshire, kkelling-hpe, don-bahls-hpe, rustydb and spillerc-hpe August 5, 2024 13:45

annapoorna-s-alt approved these changes Aug 6, 2024

View reviewed changes

shivaprasad-metimath approved these changes Aug 6, 2024

View reviewed changes

rustydb approved these changes Aug 6, 2024

View reviewed changes

rustydb merged commit d9624a3 into release/1.6 Aug 6, 2024
7 of 8 checks passed

rustydb deleted the feature/CRAYSAT-1740 branch August 6, 2024 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRAYSAT-1882: Update power off/on procedures for `sat bootsys` improvements #5282

CRAYSAT-1882: Update power off/on procedures for `sat bootsys` improvements #5282

haasken-hpe commented Aug 5, 2024 •

edited

Loading

haasken-hpe commented Aug 5, 2024

CRAYSAT-1882: Update power off/on procedures for sat bootsys improvements #5282

CRAYSAT-1882: Update power off/on procedures for sat bootsys improvements #5282

Conversation

haasken-hpe commented Aug 5, 2024 • edited Loading

Description

Details from commit messages

Checklist

haasken-hpe commented Aug 5, 2024

CRAYSAT-1882: Update power off/on procedures for `sat bootsys` improvements #5282

CRAYSAT-1882: Update power off/on procedures for `sat bootsys` improvements #5282

haasken-hpe commented Aug 5, 2024 •

edited

Loading