generated from Cray-HPE/metal-template
-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRAYSAT-1882: Update power off/on procedures for sat bootsys
improvements
#5282
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…mated. (#5143) Removing the manual process there to mount as it is automated. IM:CRAYSAT-1715 Reviewer:Ryan Currently the user mounts the s3fs filesystem manually. This has been automated. CRAYSAT-1852 automates by mounting the file systems on m001 before booting the other master(m002,m003) and worker nodes. Thus after the boot mount points would be readily available on the nodes. Hence removing the mnaul process documented. Co-authored-by: Shivaprasad Ashok Metimath <shivaprasad-ashok.metimath@hpe.com>
As part of the system power off/on procedure, the admin must use BOS to shutdown the nodes and to boot the nodes. To do so, they must find the right BOS session templates to use. Currently this procedure is duplicated in three places in the documentation. Consolidate and improve the documentation in one place, the "Prepare the System for Power Off" section, and refer to it from the other two documents which need to reference the procedure for finding the appropriate BOS session templates. Also rename the procedures for booting and shutting down compute nodes and user access nodes to use the more general term "Managed Nodes" instead, which is consistent with the IUF's terminology. Update all locations to use the new titles and markdown file names. Improve and streamline the procedures in the "Power On and Boot Managed Nodes" and "Shut Down and Power Off Managed Nodes" procedures.
…5262) * As part of sat bootsys shutdown and boot, improvements have been done. Changing the order off booting the ncn's. Adding prompt to bypass ceph health wait updating SAT command to shutdown cabinets using PCS PCS command inplace of CAPMC cmd Move the ceph troubleshoot under ncn-power stage * Apply suggestions from code review Signed-off-by: Nathan Rockershousen <nathan.rockershousen@hpe.com> --------- Signed-off-by: Nathan Rockershousen <nathan.rockershousen@hpe.com> Co-authored-by: Shivaprasad Ashok Metimath <shivaprasad-ashok.metimath@hpe.com> Co-authored-by: Nathan Rockershousen <nathan.rockershousen@hpe.com>
Improve the steps in the "Power On and Start the Management Kubernetes Cluster" section of the "System Power On Procedures". This includes the following changes: * Show the successful output of the `ncn-power` stage first. Then show exceptional conditions in sub-steps. Improve the wording. * Ensure all log messages in example output are appropriately prefixed with the log level as they will be with the latest version of `sat`. * Improve documentation around monitoring node console logs to include a `tail` command and describe using `screen` in more detail. * Remove the Ceph troubleshooting from the `platform-services` stage since it has been moved to the `ncn-power` stage. * Add a general reminder to re-run the `platform-services` stage if any unexpected errors occur.
Improve the steps in the "Shut Down and Power Off the Management Kubernetes Cluster" section of the "System Power Off Procedures". This includes the following changes: * Include more complete example output from the `ncn-power` stage of `sat bootsys shutdown`, so that the admin knows what to expect. * Add sub-steps describing what to do in known exceptional circumstances that can occur during the `ncn-power` stage of shutdown. * Improve console monitoring instructions, similarly to how they were improved for the boot process. * Remove unnecessary `ipmitool` commands that query for power status of NCNs using `ipmitool`. This is already done by the `sat bootsys shutdown --stage ncn-power` command.
Improve the procedure to power on compute cabinets as follows: * Minor wording adjustments * Note that it is safe to re-run `sat bootsys boot --stage cabinet-power` as needed until it's successful. * Change CAPMC references and commands to PCS. Note that these instructions were taken from the document on the same name in the `Power_Control_Service` directory, and that copy of the procedure will be removed in CRAYSAT-1891.
haasken-hpe
requested review from
haroldlongley,
shivaprasad-metimath and
annapoorna-s-alt
August 5, 2024 13:45
haasken-hpe
requested review from
jnowicki-hpe,
nrockershousen,
shunr-hpe,
GunashekarKM,
SavioJoshua,
sfkramer,
alexanderkingh,
brookshire,
kkelling-hpe,
don-bahls-hpe,
rustydb and
spillerc-hpe
August 5, 2024 13:45
The markdown-link-check failures appear to be existing bad links unrelated to the changes made in this PR. |
annapoorna-s-alt
approved these changes
Aug 6, 2024
shivaprasad-metimath
approved these changes
Aug 6, 2024
rustydb
approved these changes
Aug 6, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Changes were made to the
sat bootsys
command to improve the full-system power off/on procedures in docs-csm. This pull request updates the documentation to match the current behavior ofsat bootsys
.Details from commit messages
CRAYSAT-1882: Improve "Power On Compute Cabinets"
Improve the procedure to power on compute cabinets as follows:
sat bootsys boot --stage cabinet-power
as needed until it's successful.instructions were taken from the document on the same name in the
Power_Control_Service
directory, and that copy of the procedure willbe removed in CRAYSAT-1891.
CRAYSAT-1882: Improve steps for shutting down K8s cluster
Improve the steps in the "Shut Down and Power Off the Management
Kubernetes Cluster" section of the "System Power Off Procedures". This
includes the following changes:
ncn-power
stage ofsat bootsys shutdown
, so that the admin knows what to expect.that can occur during the
ncn-power
stage of shutdown.improved for the boot process.
ipmitool
commands that query for power status ofNCNs using
ipmitool
. This is already done by thesat bootsys shutdown --stage ncn-power
command.CRAYSAT-1882: Improve steps for starting Kubernetes cluster
Improve the steps in the "Power On and Start the Management Kubernetes
Cluster" section of the "System Power On Procedures". This includes the
following changes:
ncn-power
stage first. Then showexceptional conditions in sub-steps. Improve the wording.
with the log level as they will be with the latest version of
sat
.tail
command and describe usingscreen
in more detail.platform-services
stagesince it has been moved to the
ncn-power
stage.platform-services
stage if anyunexpected errors occur.
CRAYSAT-1877: Improvement on platform services stage
CRAYSAT-1882: sat bootsys shutdown and boot improvements doc changes (CRAYSAT-1882: sat bootsys shutdown and boot improvements doc changes #5262)
Changing the order off booting the ncn's.
Adding prompt to bypass ceph health wait
updating SAT command to shutdown cabinets using PCS
PCS command inplace of CAPMC cmd
Move the ceph troubleshoot under ncn-power stage
CRAYSAT-1711: Improve procedure to get BOS session templates (CRAYSAT-1711: Improve procedure to get BOS session templates #5260)
As part of the system power off/on procedure, the admin must use BOS to
shutdown the nodes and to boot the nodes. To do so, they must find the
right BOS session templates to use.
Currently this procedure is duplicated in three places in the
documentation. Consolidate and improve the documentation in one place,
the "Prepare the System for Power Off" section, and refer to it from the
other two documents which need to reference the procedure for finding
the appropriate BOS session templates.
Also rename the procedures for booting and shutting down compute nodes
and user access nodes to use the more general term "Managed Nodes" instead, which is consistent
with the IUF's terminology. Update all locations to use the new titles
and markdown file names.
Improve and streamline the procedures in the "Power On and Boot Managed
Nodes" and "Shut Down and Power Off Managed Nodes" procedures.
CRAYSAT-1715 Removing the manual process there to mount as it is automated. (CRAYSAT-1715 Removing the manual process there to mount as it is automated. #5143)
Removing the manual process there to mount as it is automated.
IM:CRAYSAT-1715
Reviewer:Ryan
Currently the user mounts the s3fs filesystem manually. This has been automated.
CRAYSAT-1852 automates by mounting the file systems on m001 before booting the
other master(m002,m003) and worker nodes. Thus after the boot mount points would
be readily available on the nodes. Hence removing the mnaul process documented.
Checklist
.github/CODEOWNERS
with the corresponding team in Cray-HPE.