The Cray System Management (CSM) operational activities are administrative procedures required to operate an HPE Cray EX system with CSM software installed.
The following administrative topics can be found in this guide:
- CSM Operational Activities
- CSM Product Management
- Image Management
- Boot Orchestration
- System Power Off Procedures
- System Power On Procedures
- Power Management
- Artifact Management
- Compute Rolling Upgrades
- Configuration Management
- Kubernetes
- Package Repository Management
- Security and Authentication
- Resiliency
- ConMan
- Utility Storage
- System Management Health
- System Layout Service (SLS)
- System Configuration Service
- Hardware State Manager (HSM)
- Hardware Management (HM) Collector
- Node Management
- River Endpoint Discovery Service (REDS)
- Network
- Update Firmware with FAS
- User Access Service (UAS)
- Validate CSM Health
- Configure Keycloak Account
- Configure the Cray Command Line Interface (cray CLI)
- Change Passwords and Credentials
- Manage a Configuration with CFS
- Access the LiveCD USB Device After Reboot
- Post-Install Customizations
- Validate Signed RPMs
Build and customize image recipes with the Image Management Service (IMS).
- Image Management
- Image Management Workflows
- Upload and Register an Image Recipe
- Build a New UAN Image Using the Default Recipe
- Build an Image Using IMS REST Service
- Customize an Image Root Using IMS
- Delete or Recover Deleted IMS Content
- Configure IMS to validate RPMS
Use the Boot Orchestration Service (BOS) to boot, configure, and shut down collections of nodes.
- Boot Orchestration Service (BOS)
- BOS Workflows
- BOS Session Templates
- BOS Sessions
- Manage a BOS Session
- View the Status of a BOS Session
- Limit the Scope of a BOS Session
- Configure the BOS Timeout When Booting Compute Nodes
- Kernel Boot Parameters
- Check the Progress of BOS Session Operations
- Clean Up Logs After a BOA Kubernetes Job
- Clean Up After a BOS/BOA Job is Completed or Cancelled
- Troubleshoot UAN Boot Issues
- Troubleshoot Booting Nodes with Hardware Issues
- BOS Limitations for Gigabyte BMC Hardware
- Stage Changes without BOS
- Compute Node Boot Sequence
- Healthy Compute Node Boot Process
- Node Boot Root Cause Analysis
- Compute Node Boot Issue Symptom: Duplicate Address Warnings and Declined DHCP Offers in Logs
- Compute Node Boot Issue Symptom: Node is Not Able to Download the Required Artifacts
- Compute Node Boot Issue Symptom: Message About Invalid EEPROM Checksum in Node Console or Log
- Boot Issue Symptom: Node HSN Interface Does Not Appear or Show Detected Links Detected
- Compute Node Boot Issue Symptom: Node Console or Logs Indicate that the Server Response has Timed Out
- Tools for Resolving Compute Node Boot Issues
- Troubleshoot Compute Node Boot Issues Related to Unified Extensible Firmware Interface (UEFI)
- Troubleshoot Compute Node Boot Issues Related to Dynamic Host Configuration Protocol (DHCP)
- Troubleshoot Compute Node Boot Issues Related to the Boot Script Service
- Troubleshoot Compute Node Boot Issues Related to Trivial File Transfer Protocol (TFTP)
- Troubleshoot Compute Node Boot Issues Using Kubernetes
- Log File Locations and Ports Used in Compute Node Boot Troubleshooting
- Troubleshoot Compute Node Boot Issues Related to Slow Boot Times
- Edit the iPXE Embedded Boot Script
- Redeploy the iPXE and TFTP Services
- Upload Node Boot Information to Boot Script Service (BSS)
Procedures required for a full power off of an HPE Cray EX system.
- System Power Off Procedures
- Prepare the System for Power Off
- Shut Down and Power Off Compute and User Access Nodes
- Save Management Network Switch Configuration Settings
- Power Off Compute and IO Cabinets
- Shut Down and Power Off the Management Kubernetes Cluster
- Power Off the External Lustre File System
Procedures required for a full power on of an HPE Cray EX system.
- System Power On Procedures
- Power On and Start the Management Kubernetes Cluster
- Power On the External Lustre File System
- Power On Compute and IO Cabinets
- Bring Up the Slingshot Fabric
- Power On and Boot Compute and User Access Nodes
- Recover from a Liquid Cooled Cabinet EPO Event
HPE Cray System Management (CSM) software manages and controls power out-of-band through Redfish APIs.
- Power Management
- Cray Advanced Platform Monitoring and Control (CAPMC)
- Liquid Cooled Node Power Management
- Standard Rack Node Power Management
- Ignore Nodes with CAPMC
- Set the Turbo Boost Limit
Use the Ceph Object Gateway Simple Storage Service (S3) API to manage artifacts on the system.
- Artifact Management
- Manage Artifacts with the Cray CLI
- Use S3 Libraries and Clients
- Generate Temporary S3 Credentials
Upgrade sets of compute nodes with the Compute Rolling Upgrade Service (CRUS) without requiring an entire set of nodes to be out of service at once. CRUS enables administrators to limit the impact on production caused from upgrading compute nodes by working through one step of the upgrade process at a time.
- Compute Rolling Upgrade Service (CRUS)
- CRUS Workflow
- Upgrade Compute Nodes with CRUS
- Troubleshoot Nodes Failing to Upgrade in a CRUS Session
- Troubleshoot a Failed CRUS Session Because of Unmet Conditions
- Troubleshoot a Failed CRUS Session Because of Bad Parameters
The Configuration Framework Service (CFS) is available on systems for remote execution and configuration management of nodes and boot images.
- Configuration Management
- Configuration Layers
- Ansible Inventory
- Configuration Sessions
- Create a CFS Session with Dynamic Inventory
- Create an Image Customization CFS Session
- Set Limits for a Configuration Session
- Use a Specific Inventory for a Configuration Session
- Change the Ansible Verbosity Logs
- Set the ansible.cfg for a Session
- Delete CFS Sessions
- Automatic Session Deletion with sessionTTL
- Track the Status of a Session
- View Configuration Session Logs
- Troubleshoot Ansible Play Failures in CFS Sessions
- Troubleshoot CFS Session Failing to Complete
- Configuration Management with the CFS Batcher
- Configuration Management of System Components
- Ansible Execution Environments
- CFS Global Options
- Version Control Service (VCS)
- Write Ansible Code for CFS
- Ansible Inventory
- Configuration Sessions
- Create a CFS Session with Dynamic Inventory
- Create an Image Customization CFS Session
- Set Limits for a Configuration Session
- Change the Ansible Verbosity Logs
- Set the ansible.cfg for a Session
- Delete CFS Sessions
- Automatic Session Deletion with sessionTTL
- Track the Status of a Session
- View Configuration Session Logs
- Troubleshoot Ansible Play Failures in CFS Sessions
- Troubleshoot CFS Session Failing to Complete
- Configuration Management with the CFS Batcher
- Configuration Management of System Components
- Ansible Execution Environments
- CFS Global Options
- Version Control Service (VCS)
- Write Ansible Code for CFS * Target Ansible Tasks for Image Customization
The system management components are broken down into a series of micro-services. Each service is independently deployable, fine-grained, and uses lightweight protocols. As a result, the system's micro-services are modular, resilient, and can be updated independently. Services within the Kubernetes architecture communicate via REST APIs.
- Kubernetes Architecture
- About kubectl
- About Kubernetes Taints and Labels
- Kubernetes Storage
- Kubernetes Networking
- Retrieve Cluster Health Information Using Kubernetes
- Pod Resource Limits
- About etcd
- Check the Health and Balance of etcd Clusters
- Rebuild Unhealthy etcd Clusters
- Backups for etcd-operator Clusters
- Create a Manual Backup of a Healthy etcd Cluster
- Restore an etcd Cluster from a Backup
- Repopulate Data in etcd Clusters When Rebuilding Them
- Restore Bare-Metal etcd Clusters from an S3 Snapshot
- Rebalance Healthy etcd Clusters
- Check for and Clear etcd Cluster Alarms
- Report the Endpoint Status for etcd Clusters
- Clear Space in an etcd Cluster Database
- About Postgres
Repositories are added to systems to extend the system functionality beyond what is initially delivered. The Sonatype Nexus Repository Manager is the primary method for repository management. Nexus hosts the Yum, Docker, raw, and Helm repositories for software and firmware content.
- Package Repository Management
- Package Repository Management with Nexus
- Manage Repositories with Nexus
- Nexus Configuration
- Nexus Deployment
- Restrict Admin Privileges in Nexus
- Repair Yum Repository Metadata
Mechanisms used by the system to ensure the security and authentication of internal and external requests.
- System Security and Authentication
- Manage System Passwords * Update NCN Passwords * Change Root Passwords for Compute Nodes * Change NCN Image Root Password and SSH Keys
- SSH Keys
- Authenticate an Account with the Command Line
- Default Keycloak Realms, Accounts, and Clients
- Certificate Types
- Change the Keycloak Admin Password
- Create a Service Account in Keycloak
- Retrieve the Client Secret for Service Accounts
- Get a Long-Lived Token for a Service Account
- Access the Keycloak User Management UI
- Create Internal User Accounts in the Keycloak Shasta Realm
- Delete Internal User Accounts in the Keycloak Shasta Realm
- Create Internal User Groups in the Keycloak Shasta Realm
- Remove Internal Groups from the Keycloak Shasta Realm
- Remove the Email Mapper from the LDAP User Federation
- Re-Sync Keycloak Users to Compute Nodes
- Keycloak Operations
- Configure Keycloak for LDAP/AD authentication
- Configure the RSA Plugin in Keycloak
- Preserve Username Capitalization for Users Exported from Keycloak
- Change the LDAP Server IP Address for Existing LDAP Server Content
- Change the LDAP Server IP Address for New LDAP Server Content
- Remove the LDAP User Federation from Keycloak
- Add LDAP User Federation
- Public Key Infrastructure (PKI)
- Public Key Infrastructure (PKI)
- Troubleshoot SPIRE Failing to Start on NCNs
- API Authorization
HPE Cray EX systems are designed so that system management services (SMS) are fully resilient and that there is no single point of failure.
- Resiliency
- Resilience of System Management Services
- Restore System Functionality if a Kubernetes Worker Node is Down
- Recreate StatefulSet Pods on Another Node
- NTP Resiliency
ConMan is a tool used for connecting to remote consoles and collecting console logs. These node logs can then be used for various administrative purposes, such as troubleshooting node boot issues.
- Access Compute Node Logs
- Access Console Log Data Via the System Monitoring Framework (SMF)
- Manage Node Consoles
- Log in to a Node Using ConMan
- Establish a Serial Connection to NCNs
- Disable ConMan After System Software Installation
- Troubleshoot ConMan Blocking Access to a Node BMC
- Troubleshoot ConMan Failing to Connect to a Console
- Troubleshoot ConMan Asking for Password on SSH Connection
Ceph is the utility storage platform that is used to enable pods to store persistent data. It is deployed to provide block, object, and file storage to the management services running on Kubernetes, as well as for telemetry data coming from the compute nodes.
- Utility Storage
- Collect Information about the Ceph Cluster
- Manage Ceph Services
- Adjust Ceph Pool Quotas
- Add Ceph OSDs
- Shrink Ceph OSDs
- Ceph Health States
- Dump Ceph Crash Data
- Identify Ceph Latency Issues
- Cephadm Reference Material
- Restore Nexus Data After Data Corruption
- Troubleshoot Failure to Get Ceph Health
- Troubleshoot a Down OSD
- Troubleshoot Ceph OSDs Reporting Full
- Troubleshoot System Clock Skew
- Troubleshoot an Unresponsive S3 Endpoint
- Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
- Troubleshoot Pods Failing to Restart on Other Worker Nodes
- Troubleshoot Large Object Map Objects in Ceph Health
- Troubleshoot Failure of RGW Health Check
Enable system administrators to assess the health of their system. Operators need to quickly and efficiently troubleshoot system issues as they occur and be confident that a lack of issues indicates the system is operating normally.
- System Management Health
- System Management Health Checks and Alerts
- Access System Management Health Services
- Configure Prometheus Email Alert Notifications
The System Layout Service (SLS) holds information about the system design, such as the physical locations of network hardware, compute nodes, and cabinets. It also stores information about the network, such as which port on which switch should be connected to each compute node.
- System Layout Service (SLS)
- Dump SLS Information
- Load SLS Database with Dump File
- Add UAN CAN IP Addresses to SLS
- Create a Backup of the SLS Postgres Database
- Restore SLS Postgres Database from Backup
- Restore SLS Postgres without an Existing Backup
The System Configuration Service (SCSD) allows administrators to set various BMC and controller parameters. These parameters are typically set during discovery, but this tool enables parameters to be set before or after discovery. The operations to change these parameters are available in the Cray CLI under the scsd
command.
- System Configuration Service
- Configure BMC and Controller Parameters with SCSD
- Manage Parameters with the scsd Service
- Set BMC Credentials
Use the Hardware State Manager (HSM) to monitor and interrogate hardware components in the HPE Cray EX system, tracking hardware state and inventory information, and making it available via REST queries and message bus events when changes occur.
- Hardware State Manager (HSM)
- Hardware Management Services (HMS) Locking API
- Component Groups and Partitions
- Hardware State Manager (HSM) State and Flag Fields
- HSM Roles and Subroles
- Add an NCN to the HSM Database
- Add a Switch to the HSM Database
- Create a Backup of the HSM Postgres Database
- Restore HSM Postgres from a Backup
- Restore HSM Postgres without a Backup
The Hardware Management (HM) Collector is used to collect telemetry and Redfish events from hardware in the system.
Monitor and manage compute nodes (CNs) and non-compute nodes (NCNs) used in the HPE Cray EX system.
- Node Management
- Node Management Workflows
- Rebuild NCNs
- Reboot NCNs
- Enable Nodes
- Disable Nodes
- Find Node Type and Manufacturer
- Add a Standard Rack Node
- Clear Space in Root File System on Worker Nodes
- Manually Wipe Boot Configuration on Nodes to be Reinstalled
- Troubleshoot Issues with Redfish Endpoint DiscoveryCheck for Redfish Events from Nodes
- Reset Credentials on Redfish Devices
- Access and Update Settings for Replacement NCNs
- Change Settings for HMS Collector Polling of Air Cooled Nodes
- Use the Physical KVM
- Launch a Virtual KVM on Gigabyte Servers
- Launch a Virtual KVM on Intel Servers
- Change Java Security Settings
- Verify Accuracy of the System Clock
- Configuration of NCN Bonding
- Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
- Check the BMC Failover Mode
- Update Compute Node Mellanox HSN NIC Firmware
- TLS Certificates for Redfish BMCs
- Dump a Non-Compute Node
- Enable Passwordless Connections to Liquid Cooled Node BMCs
- Configure NTP on NCNs
The River Endpoint Discovery Service (REDS) performs geolocation and initialization of compute nodes, based on a mapping file that is provided with each system.
- Configure a Management Switch for REDS
- Initialize and Geolocate Nodes
- Verify Node Removal
- Troubleshoot Common REDS Issues
Overview of the several different networks supported by the HPE Cray EX system.
- Network
- Access to System Management Services
- Default IP Address Ranges
- Connect to the HPE Cray EX Environment
HPE Cray EX systems can have network switches in many roles: spine switches, leaf switches, aggregation switches, and CDU switches. Newer systems have HPE Aruba switches, while older systems have Dell and Mellanox switches. Switch IP addresses are generated by Cray Site Init (CSI).
- Management Network Switch Rename
- Management Network ACL configuration
- Management Network CAN setup
- Management Network Flow Control Settings
- Management Network Access Port configurations
- Update Management Network Firmware
The Customer Access Network (CAN) provides access from outside the customer network to services, NCNs, and User Access Nodes (UANs) in the system.
- Customer Access Network (CAN)
- Required Labels if CAN is Not Configured
- Externally Exposed Services
- Connect to the CAN
- CAN with Dual-Spine Configuration
- Troubleshoot CAN Issues
The DHCP service on the HPE Cray EX system uses the Internet Systems Consortium (ISC) Kea tool. Kea provides more robust management capabilities for DHCP servers.
The central DNS infrastructure provides the structural networking hierarchy and datastore for the system.
- DNS
- Manage the DNS Unbound Resolver
- Enable ncsd on UANs
- Troubleshoot Common DNS Issues
- Troubleshoot DNS Configuration Issues
External DNS, along with the Customer Access Network (CAN), Border Gateway Protocol (BGP), and MetalLB, makes it simpler to access the HPE Cray EX API and system management services. Services are accessible directly from a laptop without needing to tunnel into a non-compute node (NCN) or override /etc/hosts settings.
- External DNS
- External DNS csi config init Input Values
- Update the system-name.site-domain Value Post-Installation
- Update the can-external-dns Value Post-Installation
- Ingress Routing
- Add NCNs and UANs to External DNS
- External DNS Failing to Discover Services Workaround
- Troubleshoot Connectivity to Services with External IP addresses
- Troubleshoot DNS Configuration Issues
MetalLB is a component in Kubernetes that manages access to LoadBalancer services from outside the Kubernetes cluster. There are LoadBalancer services on the Node Management Network (NMN), Hardware Management Network (HMN), and Customer Access Network (CAN).
MetalLB can run in either Layer2-mode or BGP-mode for each address pool it manages. BGP-mode is used for the NMN, HMN, and CAN. This enables true load balancing (Layer2-mode does failover, not load balancing) and allows for a more robust layer 3 configuration for these networks.
- MetalLB in BGP-Mode
- MetalLB in BGP-Mode Configuration
- Check BGP Status and Reset Sessions
- Update BGP Neighbors
- Troubleshoot Services without an Allocated IP Address
- Troubleshoot BGP not Accepting Routes from MetalLB
The Firmware Action Service (FAS) provides an interface for managing firmware versions of Redfish-enabled hardware in the system. FAS interacts with the Hardware State Managers (HSM), device data, and image data in order to update firmware.
See Update Firmware with FAS for a list components that are upgradable with FAS. Refer to the HPC Firmware Pack (HFP) product stream to update firmware on other components.
- Update Firmware with FAS
- FAS CLI
- FAS Filters
- FAS Recipes
- FAS Admin Procedures
- FAS Use Cases
- Upload Olympus BMC Recovery Firmware into TFTP Server
- Install HPC Firmware Pack (HFP)
- Install HPC Firmware Pack from PIT or LiveCD
The User Access Service (UAS) is a containerized service managed by Kubernetes that enables application developers to create and run user applications. Users launch a User Access Instance (UAI) using the cray
command. Users can also transfer data between the Cray system and external systems using the UAI.
- User Access Service (UAS)
- End-User UAIs
- Special Purpose UAIs
- Elements of a UAI
- UAI Host Nodes
- UAI macvlans Network Attachments
- UAI Host Node Selection
- UAI Network Attachments
- Configure UAIs in UAS
- UAI Management
- Legacy Mode User-Driven UAI Management
- Broker Mode UAI Management
- UAI Images
- Troubleshoot UAS Issues
- Troubleshoot UAS by Viewing Log Output
- Troubleshoot UAIs by Viewing Log Output
- Troubleshoot Stale Brokered UAIs
- Troubleshoot UAI Stuck in "ContainerCreating"
- Troubleshoot Duplicate Mount Paths in a UAI
- Troubleshoot Missing or Incorrect UAI Images
- Troubleshoot UAIs with Administrative Access
- Troubleshoot Common Mistakes when Creating a Custom End-User UAI Image