Validate CSM Health

Anytime after the installation of the CSM services, the health of the management nodes and all CSM services can be validated.

The following are examples of when to run health checks:

After CSM install.sh completes
Before and after NCN reboots
After the system is brought back up
Any time there is unexpected behavior observed
In order to provide relevant information to create support tickets

The areas should be tested in the order they are listed on this page. Errors in an earlier check may cause errors in later checks because of dependencies.

Topics:

Validate CSM Health

1. Platform Health Checks

Scripts do not verify results. Script output includes analysis needed to determine pass/fail for each check. All health checks are expected to pass.

Health Check scripts can be run:

After CSM install.sh has been run (not before)
Before and after one of the NCNs reboots
After the system or a single node goes down unexpectedly
After the system is gracefully shut down and brought up
Any time there is unexpected behavior on the system to get a baseline of data for CSM services and components
In order to provide relevant information to support tickets that are being opened after CSM install.sh has been run

Available Platform Health Checks:

ncnHealthChecks
ncnPostgresHealthChecks
BGP Peering Status and Reset
KEA / DHCP
External DNS
Spire Agent
Vault Cluster
Automated Goss Testing

1.1 ncnHealthChecks

Health Check scripts can be found and run on any worker or master node (not on PIT node), from any directory.

ncn# /opt/cray/platform-utils/ncnHealthChecks.sh

The ncnHealthChecks script reports the following health information:

Kubernetes status for master and worker NCNs
Ceph health status
Health of etcd clusters
Number of pods on each worker node for each etcd cluster
Alarms set for any of the Etcd clusters
Health of Etcd cluster's database
List of automated etcd backups for the Boot Orchestration Service (BOS), Boot Script Service (BSS), Compute Rolling Upgrade Service (CRUS), and Domain Name Service (DNS), and Firmware Action Service (FAS) clusters
NCN node uptimes
NCN master and worker node resource consumption
NCN node xnames and metal.no-wipe status
NCN worker node pod counts
Pods yet to reach the running state

Execute the ncnHealthChecks script and analyze the output of each individual check.

IMPORTANT: When the PIT node is booted, the NCN node metal.no-wipe status is not available and is correctly reported as 'unavailable'. Once ncn-m001 has been booted, the NCN metal.no-wipe status is expected to be reported as metal.no-wipe=1.

IMPORTANT: Only when ncn-m001 has been booted, if the output of the ncnHealthChecks.sh script shows that there are nodes that do not have the metal.no-wipe=1 status, then do the following:

ncn# csi handoff bss-update-param --set metal.no-wipe=1 --limit <SERVER_XNAME>

IMPORTANT: If the output of pod statuses indicates that there are pods in the Evicted state, it may be due to the /root file system being filled up on the Kubernetes node in question. Kubernetes will begin evicting pods once the root file system space is at 85% until it is back under 80%. This may commonly happen on ncn-m001 as it is a location that install and doc files may be downloaded to. It may be necessary to clean up space in the /root directory if this is the root cause of pod evictions. The following commands can be used to determine if analysis of files under /root is needed to free-up space.

ncn# df -h /root
Filesystem      Size  Used Avail Use% Mounted on
LiveOS_rootfs   280G  245G   35G  88% /

ncn# du -h -s /root/
225G  /root/

ncn# du -ah -B 1024M /root | sort -n -r | head -n 10

Note: The cray-crus- pod is expected to be in the Init state until slurm and munge are installed. In particular, this will be the case if executing this as part of the validation after completing the Install CSM Services. If in doubt, validate the CRUS service using the CMS Validation Tool. If the CRUS check passes using that tool, do not worry about the cray-crus- pod state.

Additionally, hmn-discovery and unbound manager cronjob pods may be in a 'NotReady' state. This is expected as these pods are periodically started and transition to the completed state.

1.2 ncnPostgresHealthChecks

Postgres Health Check scripts can be found and run on any worker or master node (not on PIT node), from any directory. The ncnPostgresHealthChecks script reports the following postgres health information:

The status of each postgresql resource
The number of cluster members
The node which is the Leader
The state of the each cluster member
Replication Lag for any cluster member
Kubernetes postgres pod status

Execute ncnPostgresHealthChecks script and analyze the output of each individual check.

ncn# /opt/cray/platform-utils/ncnPostgresHealthChecks.sh

Check the STATUS of the postgresql resources which are managed by the operator:

NAMESPACE   NAME                         TEAM                VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE   STATUS
services    cray-sls-postgres            cray-sls            11        3      1Gi                                     12d   Running

If any postgresql resources remains in a STATUS other than Running (such as SyncFailed), refer to Troubleshoot Postgres Database.

For a particular Postgres cluster, the expected output is similar to the following:

--- patronictl, version 1.6.5, list for services leader pod cray-sls-postgres-0 ---
+ Cluster: cray-sls-postgres (6938772644984361037) ---+----+-----------+
|        Member       |    Host    |  Role  |  State  | TL | Lag in MB |
+---------------------+------------+--------+---------+----+-----------+
| cray-sls-postgres-0 | 10.47.0.35 | Leader | running |  1 |           |
| cray-sls-postgres-1 | 10.36.0.33 |        | running |  1 |         0 |
| cray-sls-postgres-2 | 10.44.0.42 |        | running |  1 |         0 |
+---------------------+------------+--------+---------+----+-----------+

The points below will cover the data in the table above for Member, Role, State, and Lag in MB columns.

For each Postgres cluster:

Verify there are three cluster members (with the exception of sma-postgres-cluster where there should be only two cluster members). If the number of cluster members is not correct, refer to Troubleshoot Postgres Database.

Verify there is one cluster member with the Leader Role and log output indicates expected status. Such as:

i am the leader with the lock

For example:

--- Logs for services Leader Pod cray-sls-postgres-0 ---
   ERROR: get_cluster
   INFO: establishing a new patroni connection to the postgres cluster
   INFO: initialized a new cluster
   INFO: Lock owner: cray-sls-postgres-0; I am cray-sls-postgres-0
   INFO: Lock owner: None; I am cray-sls-postgres-0
   INFO: no action. i am the leader with the lock
   INFO: No PostgreSQL configuration items changed, nothing to reload.
   INFO: postmaster pid=87
   INFO: running post_bootstrap
   INFO: trying to bootstrap a new cluster

Errors reported prior to the lock status, such as ERROR: get_cluster or ERROR: ObjectCache.run ProtocolError('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read)) can be ignored. If there is no Leader, refer to Troubleshoot Postgres Database.

Verify the State of each cluster member is 'running'. If any cluster members are found to be in a non 'running' state (such as 'start failed'), refer to Troubleshoot Postgres Database.
Verify there is no large or growing lag. If any cluster members are found to have lag or lag is 'unknown', refer to Troubleshoot Postgres Database.

Check that all Kubernetes Postgres pods have a STATUS of Running.

ncn# kubectl get pods -A -o wide -l application=spilo
NAMESPACE           NAME                                                              READY   STATUS             RESTARTS   AGE     IP            NODE       NOMINATED NODE   READINESS GATES
services            cray-sls-postgres-0                                               3/3     Running            3          6d      10.38.0.102   ncn-w002   <none>           <none>
services            cray-sls-postgres-1                                               3/3     Running            3          5d20h   10.42.0.89    ncn-w001   <none>           <none>
services            cray-sls-postgres-2                                               3/3     Running            0          5d20h   10.36.0.31    ncn-w003   <none>           <none>

If any Postgres pods have a STATUS other then Running, gather more information from the pod and refer to Troubleshoot Postgres Database.

ncn# kubectl describe pod <pod name> -n <pod namespace>
ncn# kubectl logs <pod name> -n <pod namespace> -c <pod container name>

1.3 BGP Peering Status and Reset

Verify that Border Gateway Protocol (BGP) peering sessions are established for each worker node on the system.

Check the Border Gateway Protocol (BGP) status on the Aruba or Mellanox switches. Verify that all sessions are in an Established state. If the state of any session in the table is Idle, reset the BGP sessions.

On an NCN, determine the IP addresses of switches:

ncn-m001# kubectl get cm config -n metallb-system -o yaml | head -12

Expected output looks similar to the following:

apiVersion: v1
data:
  config: |
    peers:
    - peer-address: 10.252.0.2
      peer-asn: 65533
      my-asn: 65533
    - peer-address: 10.252.0.3
      peer-asn: 65533
      my-asn: 65533
    address-pools:
    - name: customer-access

Using the first peer-address (10.252.0.2 here), log in using ssh as the administrator to the first switch and note in the returned output if a Mellanox or Aruba switch is indicated.

ncn-m001# ssh admin@10.252.0.2

On a Mellanox switch, Mellanox Onyx Switch Management or Mellanox Switch may be displayed after logging in to the switch with ssh. In this case, proceed to the Mellanox steps.
On an Aruba switch, Please register your products now at: https://asp.arubanetworks.com may be displayed after logging in to the switch with ssh. In this case, proceed to the Aruba steps.

1.3.1 Mellanox Switch

Enable:
```
sw-spine-001# enable
```

Verify BGP is enabled:

sw-spine-001# show protocols | include bgp

Expected output looks similar to the following:

bgp:                    enabled

Check peering status:

sw-spine-001# show ip bgp summary

Expected output looks similar to the following:

VRF name                  : default
BGP router identifier     : 10.252.0.2
local AS number           : 65533
BGP table version         : 3
Main routing table version: 3
IPV4 Prefixes             : 59
IPV6 Prefixes             : 0
L2VPN EVPN Prefixes       : 0

------------------------------------------------------------------------------------------------------------------
Neighbor          V    AS           MsgRcvd   MsgSent   TblVer    InQ    OutQ   Up/Down       State/PfxRcd
------------------------------------------------------------------------------------------------------------------
10.252.1.10       4    65533        2945      3365      3         0      0      1:00:21:33    ESTABLISHED/20
10.252.1.11       4    65533        2942      3356      3         0      0      1:00:20:49    ESTABLISHED/19
10.252.1.12       4    65533        2945      3363      3         0      0      1:00:21:33    ESTABLISHED/20

If one or more BGP session is reported in an Idle state, reset BGP to re-establish the sessions:
```
sw-spine-001# clear ip bgp all
```
- It may take several minutes for all sessions to become Established. Wait a minute or so, and then verify that all sessions now are all reported as Established. If some sessions remain in an Idle state, re-run the clear ip bgp all command and check again.
- If after several tries one or more BGP session remains Idle, see Check BGP Status and Reset Sessions.
Repeat the above Mellanox procedure using the second peer-address (10.252.0.3 here).

1.3.2 Aruba Switch

On an Aruba switch, the prompt may include sw-spine or sw-agg.

Check BGP peering status.

sw-agg01# show bgp ipv4 unicast summary

Expected output looks similar to the following:

VRF : default
BGP Summary
-----------
 Local AS               : 65533        BGP Router Identifier  : 10.252.0.4
 Peers                  : 7            Log Neighbor Changes   : No
 Cfg. Hold Time         : 180          Cfg. Keep Alive        : 60
 Confederation Id       : 0

 Neighbor        Remote-AS MsgRcvd MsgSent   Up/Down Time State        AdminStatus
 10.252.0.5      65533       19579   19588   20h:40m:30s  Established   Up
 10.252.1.7      65533       34137   39074   20h:41m:53s  Established   Up
 10.252.1.8      65533       34134   39036   20h:36m:44s  Established   Up
 10.252.1.9      65533       34104   39072   00m:01w:04d  Established   Up
 10.252.1.10     65533       34105   39029   00m:01w:04d  Established   Up
 10.252.1.11     65533       34099   39042   00m:01w:04d  Established   Up
 10.252.1.12     65533       34101   39012   00m:01w:04d  Established   Up

If one or more BGP session is reported in a Idle state, reset BGP to re-establish the sessions:
```
sw-agg01# clear bgp *
```
- It may take several minutes for all sessions to become Established. Wait a minute or so, and then verify that all sessions now are reported as Established. If some sessions remain in an Idle state, re-run the clear bgp * command and check again.
- If after several tries one or more BGP session remains Idle, see Check BGP Status and Reset Sessions
Repeat the above Aruba procedure using the second peer-address (10.252.0.5 in this example).

1.4 Verify that KEA has active DHCP leases

Verify that KEA has active DHCP leases. Right after an fresh install of CSM, it is important to verify that KEA is currently handing out DHCP leases on the system. The following commands can be run on any of the master nodes or worker nodes.

Get an API Token:

ncn# export TOKEN=$(curl -s -S -d grant_type=client_credentials \
                 -d client_id=admin-client \
                 -d client_secret=`kubectl get secrets admin-client-auth \
                 -o jsonpath='{.data.client-secret}' | base64 -d` \
                          https://api-gw-service-nmn.local/keycloak/realms/shasta/protocol/openid-connect/token | jq -r '.access_token')

Retrieve all the leases currently in KEA:

ncn# curl -H "Authorization: Bearer ${TOKEN}" -X POST -H "Content-Type: application/json" -d '{ "command": "lease4-get-all", "service": [ "dhcp4" ] }' https://api-gw-service-nmn.local/apis/dhcp-kea | jq

If there is an non-zero amount of DHCP leases for air-cooled hardware returned, that is a good indication that KEA is working.

1.5 Verify ability to resolve external DNS

If unbound is configured to resolve outside hostnames, then the following check should be performed. If unbond is not configured to resolve outside hostnames, then this check may be skipped.

Run the following on one of the master or worker nodes (not the PIT node):

ncn# nslookup cray.com ; echo "Exit code is $?"

Expected output looks similar to the following:

Server:         10.92.100.225
Address:        10.92.100.225#53

Non-authoritative answer:
Name:   cray.com
Address: 52.36.131.229

Exit code is 0

Verify that the command has exit code 0, reports no errors, and resolves the address.

1.6 Verify Spire Agent is Running on Kubernetes NCNs

Execute the following command on all Kubernetes NCNs (i.e. all worker nodes and master nodes, excluding the PIT):

ncn# goss -g /opt/cray/tests/install/ncn/tests/goss-spire-agent-service-running.yaml validate

Known failures and how to recover:

K8S Test: Verify spire-agent is enabled and running
- The spire-agent service may fail to start on Kubernetes NCNs, logging errors (via journalctl) similar to "join token does not exist or has already been used" or the last logs containing multiple lines of "systemd[1]: spire-agent.service: Start request repeated too quickly.". Deleting the request-ncn-join-token daemonset pod running on the node may clear the issue. Even though the spire-agent systemctl service on the Kubernetes node should eventually restart cleanly, the user may have to log in to the impacted nodes and restart the service. The following recovery procedure can be run from any Kubernetes node in the cluster.
  1. Set NODE to the NCN which is experiencing the issue. In this example, ncn-w002.
```
  ncn# export NODE=ncn-w002
```
  2. Define the following function
```
ncn# function renewncnjoin() { for pod in $(kubectl get pods -n spire |grep request-ncn-join-token | awk '{print $1}'); do if kubectl describe -n spire pods $pod | grep -q "Node:.*$1"; then echo "Restarting $pod running on $1"; kubectl delete -n spire pod "$pod"; fi done }
```
  3. Run the function as follows:
```
ncn# renewncnjoin $NODE
```
- The spire-agent service may also fail if an NCN was powered off for too long and its tokens expired. If this happens, delete /root/spire/agent_svid.der, /root/spire/bundle.der, and /root/spire/data/svid.key off the NCN before deleting the request-ncn-join-token daemonset pod.

1.7 Verify the Vault Cluster is Healthy

Execute the following commands on ncn-m002:

ncn-m002# goss -g /opt/cray/tests/install/ncn/tests/goss-k8s-vault-cluster-health.yaml validate

Check the output to verify no failures are reported:

Count: 2, Failed: 0, Skipped: 0

1.8 Automated Goss Testing

There are multiple Goss test suites available that cover a variety of sub-systems.

Run the NCN health checks against the three different types of nodes with the following commands:

IMPORTANT: These tests may only be successful while booted into the PIT node. Do not run these as part of upgrade testing. This includes the Kubernetes check in the next block.

pit# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-master
pit# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-worker
pit# /opt/cray/tests/install/ncn/automated/ncn-healthcheck-storage

And the Kubernetes test suite via:

pit# /opt/cray/tests/install/ncn/automated/ncn-kubernetes-checks

1.8.1 Known Test Issues

These tests can only reliably be executed from the PIT node. Should be addressed in a future release.
K8S Test: Kubernetes Query BSS Cloud-init for ca-certs
- May fail immediately after platform install. Should pass after the TrustedCerts Operator has updated BSS (Global cloud-init meta) with CA certificates.
K8S Test: Kubernetes Velero No Failed Backups
- Because of a known issue with Velero, a backup may be attempted immediately upon the deployment of a backup schedule (for example, vault). It may be necessary to use the velero command to delete backups from a Kubernetes node to clear this situation.

1.9 Optional Check of System Management Monitoring Tools

If all designated prerequisites are met, the availability of system management health services may optionally be validated by accessing the URLs listed in Access System Management Health Services. It is very important to check the Prerequisites section of this document.

If one or more of the the URLs listed in the procedure are inaccessible, it does not necessarily mean that system is not healthy. It may simply mean that not all of the prerequisites have been met to allow access to the system management health tools via URL.

Information to assist with troubleshooting some of the components mentioned in the prerequisites can be accessed here:

Troubleshoot CAN Issues
Troubleshoot DNS Configuration Issues
Check BGP Status and Reset Sessions
Troubleshoot BGP not Accepting Routes from MetalLB
Troubleshoot Services without an Allocated IP Address

2. Hardware Management Services Health Checks

Execute the HMS smoke and functional tests after the CSM install to confirm that the Hardware Management Services are running and operational.

2.1 HMS CT Test Execution

These tests should be executed as root on at least one worker NCN and one master NCN (but not ncn-m001 if it is still the PIT node).

Run the HMS CT smoke tests. This is done by running the run_hms_ct_tests.sh script:

ncn# /opt/cray/csm/scripts/hms_verification/run_hms_ct_tests.sh

The return value of the script is 0 if all CT tests ran successfully, non-zero if not.

Running CT Tests Manually

To run the tests manually:

ncn# /opt/cray/tests/ncn-resources/hms/hms-test/hms_run_ct_smoke_tests_ncn-resources.sh

Examine the output. If one or more failures occur, investigate the cause of each failure. See the interpreting_hms_health_check_results documentation for more information.

Otherwise, run the HMS functional tests.

ncn# /opt/cray/tests/ncn-resources/hms/hms-test/hms_run_ct_functional_tests_ncn-resources.sh

Examine the output. If one or more failures occur, investigate the cause of each failure. See the interpreting_hms_health_check_results documentation for more information.

2.2 Aruba Switch SNMP Fixup

Systems with Aruba leaf switches sometimes have issues with a known SNMP bug which prevents HSM discovery from discovering all HW. At this stage of the installation process, a script can be run to detect if this issue is currently affecting the system, and if so, correct it.

Refer to Air cooled hardware is not getting properly discovered with Aruba leaf switches for details.

2.3 Hardware State Manager Discovery Validation

By this point in the installation process, the Hardware State Manager (HSM) should have done its discovery of the system.

The foundational information for this discovery is from the System Layout Service (SLS). Thus, a comparison needs to be done to see that what is specified in SLS (focusing on BMC components and Redfish endpoints) are present in HSM.

To perform this comparison execute the verify_hsm_discovery.py script on a Kubernetes master or worker NCN. The result is pass/fail (returns 0 or non-zero):

ncn# /opt/cray/csm/scripts/hms_verification/verify_hsm_discovery.py

The output will ideally appear as follows, if there are mismatches these will be displayed in the appropriate section of the output. Refer to 2.3.1 Interpreting results and 2.3.2 Known Issues below to troubleshoot any mismatched BMCs.

ncn# /opt/cray/csm/scripts/hms_verification/verify_hsm_discovery.py

HSM Cabinet Summary
===================
x1000 (Mountain)
  Discovered Nodes:          50
  Discovered Node BMCs:      25
  Discovered Router BMCs:    32
  Discovered Chassis BMCs:    8
x3000 (River)
  Discovered Nodes:          23 (12 Mgmt, 7 Application, 4 Compute)
  Discovered Node BMCs:      24
  Discovered Router BMCs:     2
  Discovered Cab PDU Ctlrs:   0

River Cabinet Checks
====================
x3000
  Nodes: PASS
  NodeBMCs: PASS
  RouterBMCs: PASS
  ChassisBMCs: PASS
  CabinetPDUControllers: PASS

Mountain/Hill Cabinet Checks
============================
x1000 (Mountain)
  ChassisBMCs: PASS
  Nodes: PASS
  NodeBMCs: PASS
  RouterBMCs: PASS

The script will have an exit code of 0 if there are no failures. If there is any FAIL information displayed, the script will exit with a non-zero exit code. Failure information interpretation is described in the next section.

2.3.1 Interpreting results

The Cabinet Checks output is divided into three sections:

Summary information for for each cabinet
Detail information for for River cabinets
Detail information for Mountain/Hill cabinets.

In the River section, any hardware found in SLS and not discovered by HSM is considered a failure, with the exception of PDU controllers, which is a warning. Also, the BMC of one of the management NCNs (typically 'ncn-m001') will not be connected to the HSM HW network and thus will show up as being not discovered and/or not having any mgmt network connection. This is treated as a warning.

In the Mountain section, the only thing considered a failure are Chassis BMCs that are not discovered in HSM. All other items (nodes, node BMCs and router BMCs) which are not discovered are considered warnings.

Any failures need to be investigated by the admin for rectification. Any warnings should also be examined by the admin to insure they are accurate and expected.

For each of the BMCs that show up as not being present in HSM components or Redfish Endpoints use the following notes to determine if the issue with the BMC can be safely ignored, or if there is a legitimate issue with the BMC.

The node BMC of 'ncn-m001' will not typically be present in HSM component data, as it is typically connected to the site network instead of the HMN network.
Chassis Management Controllers (CMC) may show up as not being present in HSM. CMCs for Intel server blades can be ignored. Gigabyte server blade CMCs not found in HSM is not normal and should be investigated. If a Gigabyte CMC is expected to not be connected to the HMN network, then it can be ignored.

CMCs have xnames in the form of xXc0sSb999, where X is the cabinet and S is the rack U of the compute node chassis.

Example mismatch for a CMC an Intel server blade:

...
  ChassisBMCs/CMCs: FAIL
    - x3000c0s10b999 - Not found in HSM Components; Not found in HSM Redfish Endpoints; No mgmt port connection.
...

HPE PDUs are not supported at this time and will likely show up as not being found in HSM. They can be ignored.

Cabinet PDU Controllers have xnames in the form of xXmM, where X is the cabinet and M is the ordinal of the Cabinet PDU Controller.

Example mistmatch for HPE PDU:

...
  CabinetPDUControllers: WARNING
    - x3000m0 - Not found in HSM Components ; Not found in HSM Redfish Endpoints
...

BMCs having no association with a management switch port will be annotated as such, and should be investigated. Exceptions to this are in Mountain or Hill configurations where Mountain BMCs will show this condition on SLS/HSM mismatches, which is normal.
In Hill configurations SLS assumes BMCs in chassis 1 and 3 are fully populated (32 Node BMCs), and in Mountain configurations SLS assumes all BMCs are fully populated (128 Node BMCs). Any non-populated BMCs will have no HSM data and will show up in the mismatch list.

If it was determined that the mismatch can not be ignored, then proceed onto the the 2.3.2 Known Issues below to troubleshoot any mismatched BMCs.

2.3.2 Known Issues

Known issues that may prevent hardware from getting discovered by Hardware State Manager:

Air cooled hardware is not getting properly discovered with Aruba leaf switches
HMS Discovery job not creating RedfishEndpoints in Hardware State Manager

3 Software Management Services Health Checks

The Software Management Services health checks are run using /usr/local/bin/cmsdev.

The tool logs to /opt/cray/tests/cmsdev.log
The -q (quiet) and -v (verbose) flags can be used to decrease or increase the amount of information sent to the screen.
- The same amount of data is written to the log file in either case.

SMS Test Execution
Interpreting cmsdev Results

3.1 SMS Test Execution

The following test can be run on any Kubernetes node (any master or worker node, but not the PIT node).

ncn# /usr/local/bin/cmsdev test -q all

3.2 Interpreting cmsdev Results

If all checks passed:

The return code will be 0
The final line of output will begin with SUCCESS

For example:

ncn# /usr/local/bin/cmsdev test -q all
...
SUCCESS: All 7 service tests passed: bos, cfs, conman, crus, ims, tftp, vcs
ncn# echo $?
0

If one or more checks failed:

The return code will be non-0
The final line of output will begin with FAILURE and will list which checks failed

For example:

ncn# /usr/local/bin/cmsdev test -q all
...
FAILURE: 2 service tests FAILED (conman, ims), 5 passed (bos, cfs, crus, tftp, vcs)
ncn# echo $?
1

Additional test execution details can be found in /opt/cray/tests/cmsdev.log.

4. Booting CSM Barebones Image

Included with the Cray System Management (CSM) release is a pre-built node image that can be used to validate that core CSM services are available and responding as expected. The CSM barebones image contains only the minimal set of RPMs and configuration required to boot an image and is not suitable for production usage. To run production work loads, it is suggested that an image from the Cray OS (COS) product, or similar, be used.

NOTES

The CSM Barebones image included with the release will not successfully complete beyond the dracut stage of the boot process. However, if the dracut stage is reached, the boot can be considered successful and shows that the necessary CSM services needed to boot a node are up and available.
- This inability to boot the barebones image fully will be resolved in future releases of the CSM product.
In addition to the CSM Barebones image, the release also includes an IMS Recipe that can be used to build the CSM Barebones image. However, the CSM Barebones recipe currently requires RPMs that are not installed with the CSM product. The CSM Barebones recipe can be built after the Cray OS (COS) product stream is also installed on to the system.
- In future releases of the CSM product, work will be undertaken to resolve these dependency issues.
This procedure can be followed on any NCN or the PIT node.
The Cray CLI must be configured on the node where this procedure is being performed. See Configure the Cray Command Line Interface for details on how to do this.

Locate CSM Barebones Image in IMS
Create a BOS Session Template for the CSM Barebones Image
Find an available compute node
Reboot the node using a BOS session template
Watch Boot on Console

4.1 Locate CSM Barebones Image in IMS

Locate the CSM Barebones image and note the etag and path fields in the output.

ncn# cray ims images list --format json | jq '.[] | select(.name | contains("barebones"))'

Expected output is similar to the following:

{
  "created": "2021-01-14T03:15:55.146962+00:00",
  "id": "293b1e9c-2bc4-4225-b235-147d1d611eef",
  "link": {
    "etag": "6d04c3a4546888ee740d7149eaecea68",
    "path": "s3://boot-images/293b1e9c-2bc4-4225-b235-147d1d611eef/manifest.json",
    "type": "s3"
  },
  "name": "cray-shasta-csm-sles15sp1-barebones.x86_64-shasta-PRODUCT_VERSION"
}

4.2 Create a BOS Session Template for the CSM Barebones Image

The session template below can be copied and used as the basis for the BOS Session Template. As noted below, make sure the S3 path for the manifest matches the S3 path shown in the Image Management Service (IMS).

Create sessiontemplate.json

ncn# vi sessiontemplate.json

The session template should contain the following:

{
  "boot_sets": {
    "compute": {
      "boot_ordinal": 2,
      "etag": "etag_value_from_cray_ims_command",
      "kernel_parameters": "console=ttyS0,115200 bad_page=panic crashkernel=340M hugepagelist=2m-2g intel_iommu=off intel_pstate=disable iommu=pt ip=dhcp numa_interleave_omit=headless numa_zonelist_order=node oops=panic pageblock_order=14 pcie_ports=native printk.synchronous=y rd.neednet=1 rd.retry=10 rd.shell turbo_boost_limit=999 spire_join_token=${SPIRE_JOIN_TOKEN}",
      "network": "nmn",
      "node_roles_groups": [
        "Compute"
      ],
      "path": "path_value_from_cray_ims_command",
      "rootfs_provider": "cpss3",
      "rootfs_provider_passthrough": "dvs:api-gw-service-nmn.local:300:nmn0",
      "type": "s3"
    }
  },
  "enable_cfs": false,
  "name": "shasta-PRODUCT_VERSION-csm-bare-bones-image"
}

NOTE: The rootfs provider shown above references the dvs provider. DVS is not provided as part of the CSM distribution and is not expected to work until the COS product is installed and configured. As noted above, the barebones image is not expected to boot at this time. Work is being done to enable a fully functional and bootable barebones image in a future release of the CSM product. Until that work is complete, the use of the dvs rootfs provider is suggested.

NOTE: Be sure to replace the values of the etag and path fields with the ones you noted earlier in the cray ims images list command.

Create the BOS session template using the following file as input:

ncn# cray bos sessiontemplate create --file sessiontemplate.json --name shasta-PRODUCT_VERSION-csm-bare-bones-image

The expected output is:

/sessionTemplate/shasta-PRODUCT_VERSION-csm-bare-bones-image

4.3 Find an available compute node

ncn# cray hsm state components list --role Compute --enabled true

Example output:

[[Components]]
ID = "x3000c0s17b1n0"
Type = "Node"
State = "On"
Flag = "OK"
Enabled = true
Role = "Compute"
NID = 1
NetType = "Sling"
Arch = "X86"
Class = "River"

[[Components]]
ID = "x3000c0s17b2n0"
Type = "Node"
State = "On"
Flag = "OK"
Enabled = true
Role = "Compute"
NID = 2
NetType = "Sling"
Arch = "X86"
Class = "River"

If it is noticed that compute nodes are missing from Hardware State Manager, refer to 2.3.2 Known Issues to troubleshoot any Node BMCs that have not been discovered.

Choose a node from those listed and set XNAME to its ID. In this example, x3000c0s17b2n0:

ncn# export XNAME=x3000c0s17b2n0

4.4 Reboot the node using a BOS session template

Create a BOS session to reboot the chosen node using the BOS session template that was created:

ncn# cray bos session create --template-uuid shasta-PRODUCT_VERSION-csm-bare-bones-image --operation reboot --limit $XNAME

Expected output looks similar to the following:

limit = "x3000c0s17b2n0"
operation = "reboot"
templateUuid = "shasta-PRODUCT_VERSION-csm-bare-bones-image"
[[links]]
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
jobId = "boa-8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
rel = "session"
type = "GET"

[[links]]
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1/status"
rel = "status"
type = "GET"

4.5 Connect to the node's console and watch the boot

See Manage Node Consoles for information on how to connect to the node's console (and for instructions on how to close it later).

The boot may take up to 10 or 15 minutes. The image being booted does not support a complete boot, so the node will not boot fully into an operating system. This test is merely to verify that the CSM services needed to boot a node are available and working properly.

This boot test is considered successful if the boot reaches the dracut stage. You know this has happened if the console output has something similar to the following somewhere within the final 20 lines of its output:

[    7.876909] dracut: FATAL: Don't know how to handle 'root=craycps-s3:s3://boot-images/e3ba09d7-e3c2-4b80-9d86-0ee2c48c2214/rootfs:c77c0097bb6d488a5d1e4a2503969ac0-27:dvs:api-gw-service-nmn.local:300:nmn0'
[    7.898169] dracut: Refusing to continue

NOTE: As long as the preceding text is found near the end of the console output, the test is considered successful. It is normal (and not indicative of a test failure) to see something similar to the following at the very end of the console output:

         Starting Dracut Emergency Shell...
[   11.591948] device-mapper: uevent: version 1.0.3
[   11.596657] device-mapper: ioctl: 4.40.0-ioctl (2019-01-18) initialised: dm-devel@redhat.com
Warning: dracut: FATAL: Don't know how to handle
Press Enter for maintenance
(or press Control-D to continue):

After the node has reached this point, close the console session. The test is complete.

5. UAS / UAI Tests

The procedures below use the CLI as an authorized user and run on two separate node types. The first part runs on the LiveCD node, while the second part runs on a non-LiveCD Kubernetes master or worker node. When using the CLI on either node, the CLI configuration needs to be initialized and the user running the procedure needs to be authorized.

The following procedures run on separate nodes of the system. They are, therefore, separated into separate sub-sections.

Validate Basic UAS Installation
Validate UAI Creation
UAS/UAI Troubleshooting

5.1 Validate the Basic UAS Installation

This section can be run on any NCN or the PIT node.

Initialize the Cray CLI on the node where you are running this section. See Configure the Cray Command Line Interface for details on how to do this.
Basic UAS installation is validated using the following: 1.
```
ncn# cray uas mgr-info list
```
Expected output looks similar to the following:
```
service_name = "cray-uas-mgr"
version = "1.11.5"
```
In this example output, it shows that UAS is installed and running the 1.11.5 version. 1.
```
ncn# cray uas list
```
Expected output looks similar to the following:
```
results = []
```
This example output shows that there are no currently running UAIs. It is possible, if someone else has been using the UAS, that there could be UAIs in the list. That is acceptable too from a validation standpoint.
Verify that the pre-made UAI images are registered with UAS
```
ncn# cray uas images list
```
Expected output looks similar to the following:
```
default_image = "dtr.dev.cray.com/cray/cray-uai-sles15sp1:latest"
image_list = [ "dtr.dev.cray.com/cray/cray-uai-sles15sp1:latest",]
```
This example output shows that the pre-made end-user UAI image (cray/cray-uai-sles15sp1:latest) is registered with UAS. This does not necessarily mean this image is installed in the container image registry, but it is configured for use. If other UAI images have been created and registered, they may also show up here, which is acceptable.

5.2 Validate UAI Creation

IMPORTANT: If you are upgrading CSM and your site does not use UAIs, skip UAS and UAI validation. If you do use UAIs, there are products that configure UAS like Cray Analytics and Cray Programming Environment. These must be working correctly with UAIs and should be validated and corrected (the procedures for this are beyond the scope of this document) prior to validating UAS and UAI. Failures in UAI creation that result from incorrect or incomplete installation of these products will generally take the form of UAIs stuck in 'waiting' state trying to set up volume mounts. See the UAI Troubleshooting section for more information.

This procedure must run on a master or worker node (not the PIT node and not ncn-w001) on the system. (It is also possible to do from an external host, but the procedure for that is not covered here).

Initialize the Cray CLI on the node where you are running this section. See Configure the Cray Command Line Interface for details on how to do this.

Verify that a UAI can be created:

ncn# cray uas create --publickey ~/.ssh/id_rsa.pub

Expected output looks similar to the following:

uai_connect_string = "ssh vers@10.16.234.10"
uai_host = "ncn-w001"
uai_img = "registry.local/cray/cray-uai-sles15sp1:latest"
uai_ip = "10.16.234.10"
uai_msg = ""
uai_name = "uai-vers-a00fb46b"
uai_status = "Pending"
username = "vers"

[uai_portmap]

This has created the UAI and the UAI is currently in the process of initializing and running.

Set UAINAME to the value of the uai_name field in the previous command output (uai-vers-a00fb46b in our example):
```
ncn# export UAINAME=uai-vers-a00fb46b
```

Check the current status of the UAI:

ncn# cray uas list

Expected output looks similar to the following:

[[results]]
uai_age = "0m"
uai_connect_string = "ssh vers@10.16.234.10"
uai_host = "ncn-w001"
uai_img = "registry.local/cray/cray-uai-sles15sp1:latest"
uai_ip = "10.16.234.10"
uai_msg = ""
uai_name = "uai-vers-a00fb46b"
uai_status = "Running: Ready"
username = "vers"

If the uai_status field is Running: Ready, proceed to the next step. Otherwise, wait and repeat this command until that is the case. It normally should not take more than a minute or two.

The UAI is ready for use. Log into it with the command in the uai_connect_string field in the previous command output:
```
ncn# ssh vers@10.16.234.10
vers@uai-vers-a00fb46b-6889b666db-4dfvn:~>
```

Run a command on the UAI:

vers@uai-vers-a00fb46b-6889b666db-4dfvn:~> ps -afe

Expected output looks similar to the following:

UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 18:51 ?        00:00:00 /bin/bash /usr/bin/uai-ssh.sh
munge         36       1  0 18:51 ?        00:00:00 /usr/sbin/munged
root          54       1  0 18:51 ?        00:00:00 su vers -c /usr/sbin/sshd -e -f /etc/uas/ssh/sshd_config -D
vers          55      54  0 18:51 ?        00:00:00 /usr/sbin/sshd -e -f /etc/uas/ssh/sshd_config -D
vers          62      55  0 18:51 ?        00:00:00 sshd: vers [priv]
vers          67      62  0 18:51 ?        00:00:00 sshd: vers@pts/0
vers          68      67  0 18:51 pts/0    00:00:00 -bash
vers         120      68  0 18:52 pts/0    00:00:00 ps -afe

Log out from the UAI

vers@uai-vers-a00fb46b-6889b666db-4dfvn:~> exit
ncn#

Clean up the UAI.

ncn# cray uas delete --uai-list $UAINAME

Expected output looks similar to the following:

results = [ "Successfully deleted uai-vers-a00fb46b",]

If the commands ran with similar results, then the basic functionality of the UAS and UAI is working.

5.3 UAS/UAI Troubleshooting

The following subsections include common failure modes seen with UAS / UAI operations and how to resolve them.

5.3.1 Authorization Issues

An error will be returned when running CLI commands if the user is not logged in as a valid Keycloak user or is accidentally using the CRAY_CREDENTIALS environment variable. This variable is set regardless of the user credentials being used.

For example:

ncn# cray uas list

The symptom of this problem is output similar to the following:

Usage: cray uas list [OPTIONS]
Try 'cray uas list --help' for help.

Error: Bad Request: Token not valid for UAS. Attributes missing: ['gidNumber', 'loginShell', 'homeDirectory', 'uidNumber', 'name']

Fix this by logging in as a real user (someone with actual Linux credentials) and making sure that CRAY_CREDENTIALS is unset.

5.3.2 UAS Cannot Access Keycloak

When running CLI commands, a Keycloak error may be returned.

For example:

ncn# cray uas list

The symptom of this problem is output similar to the following:

Usage: cray uas list [OPTIONS]
Try 'cray uas list --help' for help.

Error: Internal Server Error: An error was encountered while accessing Keycloak

If the wrong hostname was used to reach the API gateway, re-run the CLI initialization steps above and try again to check that. There may also be a problem with the Istio service mesh inside of the system. Troubleshooting this is beyond the scope of this section, but there may be useful information in the UAS pod logs in Kubernetes. There are generally two UAS pods, so the user may need to look at logs from both to find the specific failure. The logs tend to have a very large number of GET events listed as part of the liveness checking.

The following shows an example of looking at UAS logs effectively (this example shows only one UAS manager, normally there would be two):

Determine the pod name of the uas-mgr pod

ncn# kubectl get po -n services | grep "^cray-uas-mgr" | grep -v etcd

Expected output looks similar to:

cray-uas-mgr-6bbd584ccb-zg8vx                                    2/2     Running            0          12d

Set PODNAME to the name of the manager pod whose logs are being viewed.
```
ncn# export PODNAME=cray-uas-mgr-6bbd584ccb-zg8vx
```

View its last 25 log entries of the cray-uas-mgr container in that pod, excluding GET events:

ncn# kubectl logs -n services $PODNAME cray-uas-mgr | grep -v 'GET ' | tail -25

Example output:

2021-02-08 15:32:41,211 - uas_mgr - INFO - getting deployment uai-vers-87a0ff6e in namespace user
2021-02-08 15:32:41,225 - uas_mgr - INFO - creating deployment uai-vers-87a0ff6e in namespace user
2021-02-08 15:32:41,241 - uas_mgr - INFO - creating the UAI service uai-vers-87a0ff6e-ssh
2021-02-08 15:32:41,241 - uas_mgr - INFO - getting service uai-vers-87a0ff6e-ssh in namespace user
2021-02-08 15:32:41,252 - uas_mgr - INFO - creating service uai-vers-87a0ff6e-ssh in namespace user
2021-02-08 15:32:41,267 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e
2021-02-08 15:32:41,360 - uas_mgr - INFO - No start time provided from pod
2021-02-08 15:32:41,361 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user
127.0.0.1 - - [08/Feb/2021 15:32:41] "POST /v1/uas?imagename=registry.local%2Fcray%2Fno-image-registered%3Alatest HTTP/1.1" 200 -
2021-02-08 15:32:54,455 - uas_auth - INFO - UasAuth lookup complete for user vers
2021-02-08 15:32:54,455 - uas_mgr - INFO - UAS request for: vers
2021-02-08 15:32:54,455 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers
2021-02-08 15:32:54,484 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e
2021-02-08 15:32:54,596 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user
2021-02-08 15:40:25,053 - uas_auth - INFO - UasAuth lookup complete for user vers
2021-02-08 15:40:25,054 - uas_mgr - INFO - UAS request for: vers
2021-02-08 15:40:25,054 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers
2021-02-08 15:40:25,085 - uas_mgr - INFO - getting pod info uai-vers-87a0ff6e
2021-02-08 15:40:25,212 - uas_mgr - INFO - getting service info for uai-vers-87a0ff6e-ssh in namespace user
2021-02-08 15:40:51,210 - uas_auth - INFO - UasAuth lookup complete for user vers
2021-02-08 15:40:51,210 - uas_mgr - INFO - UAS request for: vers
2021-02-08 15:40:51,210 - uas_mgr - INFO - listing deployments matching: host None, labels uas=managed,user=vers
2021-02-08 15:40:51,261 - uas_mgr - INFO - deleting service uai-vers-87a0ff6e-ssh in namespace user
2021-02-08 15:40:51,291 - uas_mgr - INFO - delete deployment uai-vers-87a0ff6e in namespace user
127.0.0.1 - - [08/Feb/2021 15:40:51] "DELETE /v1/uas?uai_list=uai-vers-87a0ff6e HTTP/1.1" 200 -

5.3.3 UAI Images not in Registry

When listing or describing a UAI, an error in the uai_msg field may be returned. For example:

ncn# cray uas list

There may be something similar to the following output:

[[results]]
uai_age = "0m"
uai_connect_string = "ssh vers@10.103.13.172"
uai_host = "ncn-w001"
uai_img = "registry.local/cray/cray-uai-sles15sp1:latest"
uai_ip = "10.103.13.172"
uai_msg = "ErrImagePull"
uai_name = "uai-vers-87a0ff6e"
uai_status = "Waiting"
username = "vers"

This means the pre-made end-user UAI image is not in the local registry (or whatever registry it is being pulled from; see the uai_img value for details). To correct this, locate and push/import the image to the registry.

5.3.4 Missing Volumes and other Container Startup Issues

Various packages install volumes in the UAS configuration. All of those volumes must also have the underlying resources available, sometimes on the host node where the UAI is running sometimes from with Kubernetes. If a UAI gets stuck with a ContainerCreating uai_msg field for an extended time, this is a likely cause. UAIs run in the user Kubernetes namespace, and are pods that can be examined using kubectl describe.

Locate the pod.

ncn# kubectl get po -n user | grep <uai-name>

Investigate the problem using the pod name from the previous step.
```
ncn# kubectl describe pod -n user <pod-name>
```
If volumes are missing they will show up in the Events: section of the output. Other problems may show up there as well. The names of the missing volumes or other issues should indicate what needs to be fixed to make the UAI run.

Files

validate_csm_health.md

Latest commit

History

validate_csm_health.md

File metadata and controls

Validate CSM Health

Topics:

1. Platform Health Checks

1.1 ncnHealthChecks

1.2 ncnPostgresHealthChecks

1.3 BGP Peering Status and Reset

1.3.1 Mellanox Switch

1.3.2 Aruba Switch

1.4 Verify that KEA has active DHCP leases

1.5 Verify ability to resolve external DNS

1.6 Verify Spire Agent is Running on Kubernetes NCNs

1.7 Verify the Vault Cluster is Healthy

1.8 Automated Goss Testing

1.8.1 Known Test Issues

1.9 Optional Check of System Management Monitoring Tools

2. Hardware Management Services Health Checks

2.1 HMS CT Test Execution

Running CT Tests Manually

2.2 Aruba Switch SNMP Fixup

2.3 Hardware State Manager Discovery Validation

2.3.1 Interpreting results

2.3.2 Known Issues

3 Software Management Services Health Checks

3.1 SMS Test Execution

3.2 Interpreting cmsdev Results

4. Booting CSM Barebones Image

4.1 Locate CSM Barebones Image in IMS

4.2 Create a BOS Session Template for the CSM Barebones Image

4.3 Find an available compute node

4.4 Reboot the node using a BOS session template

4.5 Connect to the node's console and watch the boot

5. UAS / UAI Tests

5.1 Validate the Basic UAS Installation

5.2 Validate UAI Creation

5.3 UAS/UAI Troubleshooting

5.3.1 Authorization Issues

5.3.2 UAS Cannot Access Keycloak

5.3.3 UAI Images not in Registry

5.3.4 Missing Volumes and other Container Startup Issues