Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAYSAT-1882: Update power off/on procedures for sat bootsys improvements #5282

Merged
merged 7 commits into from
Aug 6, 2024

Conversation

haasken-hpe
Copy link
Contributor

@haasken-hpe haasken-hpe commented Aug 5, 2024

Description

Changes were made to the sat bootsys command to improve the full-system power off/on procedures in docs-csm. This pull request updates the documentation to match the current behavior of sat bootsys.

Details from commit messages

  • CRAYSAT-1882: Improve "Power On Compute Cabinets"

    Improve the procedure to power on compute cabinets as follows:

    • Minor wording adjustments
    • Note that it is safe to re-run sat bootsys boot --stage cabinet-power as needed until it's successful.
    • Change CAPMC references and commands to PCS. Note that these
      instructions were taken from the document on the same name in the
      Power_Control_Service directory, and that copy of the procedure will
      be removed in CRAYSAT-1891.
  • CRAYSAT-1882: Improve steps for shutting down K8s cluster

    Improve the steps in the "Shut Down and Power Off the Management
    Kubernetes Cluster" section of the "System Power Off Procedures". This
    includes the following changes:

    • Include more complete example output from the ncn-power stage of
      sat bootsys shutdown, so that the admin knows what to expect.
    • Add sub-steps describing what to do in known exceptional circumstances
      that can occur during the ncn-power stage of shutdown.
    • Improve console monitoring instructions, similarly to how they were
      improved for the boot process.
    • Remove unnecessary ipmitool commands that query for power status of
      NCNs using ipmitool. This is already done by the sat bootsys shutdown --stage ncn-power command.
  • CRAYSAT-1882: Improve steps for starting Kubernetes cluster

    Improve the steps in the "Power On and Start the Management Kubernetes
    Cluster" section of the "System Power On Procedures". This includes the
    following changes:

    • Show the successful output of the ncn-power stage first. Then show
      exceptional conditions in sub-steps. Improve the wording.
    • Ensure all log messages in example output are appropriately prefixed
      with the log level as they will be with the latest version of sat.
    • Improve documentation around monitoring node console logs to include a
      tail command and describe using screen in more detail.
    • Remove the Ceph troubleshooting from the platform-services stage
      since it has been moved to the ncn-power stage.
    • Add a general reminder to re-run the platform-services stage if any
      unexpected errors occur.
  • CRAYSAT-1877: Improvement on platform services stage

  • CRAYSAT-1882: sat bootsys shutdown and boot improvements doc changes (CRAYSAT-1882: sat bootsys shutdown and boot improvements doc changes #5262)

    • As part of sat bootsys shutdown and boot, improvements have been done.

    Changing the order off booting the ncn's.
    Adding prompt to bypass ceph health wait
    updating SAT command to shutdown cabinets using PCS
    PCS command inplace of CAPMC cmd
    Move the ceph troubleshoot under ncn-power stage

  • CRAYSAT-1711: Improve procedure to get BOS session templates (CRAYSAT-1711: Improve procedure to get BOS session templates #5260)

    As part of the system power off/on procedure, the admin must use BOS to
    shutdown the nodes and to boot the nodes. To do so, they must find the
    right BOS session templates to use.

    Currently this procedure is duplicated in three places in the
    documentation. Consolidate and improve the documentation in one place,
    the "Prepare the System for Power Off" section, and refer to it from the
    other two documents which need to reference the procedure for finding
    the appropriate BOS session templates.

    Also rename the procedures for booting and shutting down compute nodes
    and user access nodes to use the more general term "Managed Nodes" instead, which is consistent
    with the IUF's terminology. Update all locations to use the new titles
    and markdown file names.

    Improve and streamline the procedures in the "Power On and Boot Managed
    Nodes" and "Shut Down and Power Off Managed Nodes" procedures.

  • CRAYSAT-1715 Removing the manual process there to mount as it is automated. (CRAYSAT-1715 Removing the manual process there to mount as it is automated. #5143)

    Removing the manual process there to mount as it is automated.

    IM:CRAYSAT-1715
    Reviewer:Ryan

    Currently the user mounts the s3fs filesystem manually. This has been automated.

    CRAYSAT-1852 automates by mounting the file systems on m001 before booting the
    other master(m002,m003) and worker nodes. Thus after the boot mount points would
    be readily available on the nodes. Hence removing the mnaul process documented.

Checklist

  • If I added any command snippets, the steps they belong to follow the prompt conventions (see example).
  • If I added a new directory, I also updated .github/CODEOWNERS with the corresponding team in Cray-HPE.
  • My commits or Pull-Request Title contain my JIRA information, or I do not have a JIRA.

shivaprasad-metimath and others added 7 commits August 5, 2024 08:40
…mated. (#5143)

Removing the manual process there to mount as it is automated.

IM:CRAYSAT-1715
Reviewer:Ryan

Currently the user mounts the s3fs filesystem manually. This has been automated.

CRAYSAT-1852 automates by mounting the file systems on m001 before booting the
other master(m002,m003) and worker nodes. Thus after the boot mount points would
be readily available on the nodes. Hence removing the mnaul process documented.

Co-authored-by: Shivaprasad Ashok Metimath <shivaprasad-ashok.metimath@hpe.com>
As part of the system power off/on procedure, the admin must use BOS to
shutdown the nodes and to boot the nodes. To do so, they must find the
right BOS session templates to use.

Currently this procedure is duplicated in three places in the
documentation. Consolidate and improve the documentation in one place,
the "Prepare the System for Power Off" section, and refer to it from the
other two documents which need to reference the procedure for finding
the appropriate BOS session templates.

Also rename the procedures for booting and shutting down compute nodes
and user access nodes to use the more general term "Managed Nodes" instead, which is consistent
with the IUF's terminology. Update all locations to use the new titles
and markdown file names.

Improve and streamline the procedures in the "Power On and Boot Managed
Nodes" and "Shut Down and Power Off Managed Nodes" procedures.
…5262)

* As part of sat bootsys shutdown and boot, improvements have been done.

Changing the order off booting the ncn's.
Adding prompt to bypass ceph health wait
updating SAT command to shutdown cabinets using PCS
PCS command inplace of CAPMC cmd
Move the ceph troubleshoot under ncn-power stage

* Apply suggestions from code review

Signed-off-by: Nathan Rockershousen <nathan.rockershousen@hpe.com>

---------

Signed-off-by: Nathan Rockershousen <nathan.rockershousen@hpe.com>
Co-authored-by: Shivaprasad Ashok Metimath <shivaprasad-ashok.metimath@hpe.com>
Co-authored-by: Nathan Rockershousen <nathan.rockershousen@hpe.com>
Improve the steps in the "Power On and Start the Management Kubernetes
Cluster" section of the "System Power On Procedures". This includes the
following changes:

* Show the successful output of the `ncn-power` stage first. Then show
  exceptional conditions in sub-steps. Improve the wording.
* Ensure all log messages in example output are appropriately prefixed
  with the log level as they will be with the latest version of `sat`.
* Improve documentation around monitoring node console logs to include a
  `tail` command and describe using `screen` in more detail.
* Remove the Ceph troubleshooting from the `platform-services` stage
  since it has been moved to the `ncn-power` stage.
* Add a general reminder to re-run the `platform-services` stage if any
  unexpected errors occur.
Improve the steps in the "Shut Down and Power Off the Management
Kubernetes Cluster" section of the "System Power Off Procedures". This
includes the following changes:

* Include more complete example output from the `ncn-power` stage of
  `sat bootsys shutdown`, so that the admin knows what to expect.
* Add sub-steps describing what to do in known exceptional circumstances
  that can occur during the `ncn-power` stage of shutdown.
* Improve console monitoring instructions, similarly to how they were
  improved for the boot process.
* Remove unnecessary `ipmitool` commands that query for power status of
  NCNs using `ipmitool`. This is already done by the `sat bootsys
  shutdown --stage ncn-power` command.
Improve the procedure to power on compute cabinets as follows:

* Minor wording adjustments
* Note that it is safe to re-run `sat bootsys boot --stage
  cabinet-power` as needed until it's successful.
* Change CAPMC references and commands to PCS. Note that these
  instructions were taken from the document on the same name in the
  `Power_Control_Service` directory, and that copy of the procedure will
  be removed in CRAYSAT-1891.
@haasken-hpe
Copy link
Contributor Author

The markdown-link-check failures appear to be existing bad links unrelated to the changes made in this PR.

@rustydb rustydb merged commit d9624a3 into release/1.6 Aug 6, 2024
7 of 8 checks passed
@rustydb rustydb deleted the feature/CRAYSAT-1740 branch August 6, 2024 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants