CSM Operational Activities

The Cray System Management (CSM) operational activities are administrative procedures required to operate an HPE Cray EX system with CSM software installed.

The following administrative topics can be found in this guide:

CSM Operational Activities

CSM Product Management

Validate CSM Health
Configure Keycloak Account
Configure the Cray Command Line Interface (cray CLI)
Change Passwords and Credentials
Manage a Configuration with CFS
Access the LiveCD USB Device After Reboot
Post-Install Customizations
Validate Signed RPMs

Image Management

Build and customize image recipes with the Image Management Service (IMS).

Image Management
Image Management Workflows
Upload and Register an Image Recipe
Build a New UAN Image Using the Default Recipe
Build an Image Using IMS REST Service
Customize an Image Root Using IMS
Delete or Recover Deleted IMS Content
Configure IMS to validate RPMS

Boot Orchestration

Use the Boot Orchestration Service (BOS) to boot, configure, and shut down collections of nodes.

Boot Orchestration Service (BOS)
BOS Workflows
BOS Session Templates
BOS Sessions
BOS Limitations for Gigabyte BMC Hardware
Stage Changes without BOS
Compute Node Boot Sequence

System Power Off Procedures

Procedures required for a full power off of an HPE Cray EX system.

System Power Off Procedures
Prepare the System for Power Off
Shut Down and Power Off Compute and User Access Nodes
Save Management Network Switch Configuration Settings
Power Off Compute and IO Cabinets
Shut Down and Power Off the Management Kubernetes Cluster
Power Off the External Lustre File System

System Power On Procedures

Procedures required for a full power on of an HPE Cray EX system.

System Power On Procedures
Power On and Start the Management Kubernetes Cluster
Power On the External Lustre File System
Power On Compute and IO Cabinets
Bring Up the Slingshot Fabric
Power On and Boot Compute and User Access Nodes
Recover from a Liquid Cooled Cabinet EPO Event

Power Management

HPE Cray System Management (CSM) software manages and controls power out-of-band through Redfish APIs.

Power Management
Cray Advanced Platform Monitoring and Control (CAPMC)
Liquid Cooled Node Power Management
- User Access to Compute Node Power Data
Standard Rack Node Power Management
Ignore Nodes with CAPMC
Set the Turbo Boost Limit

Artifact Management

Use the Ceph Object Gateway Simple Storage Service (S3) API to manage artifacts on the system.

Artifact Management
Manage Artifacts with the Cray CLI
Use S3 Libraries and Clients
Generate Temporary S3 Credentials

Compute Rolling Upgrades

Upgrade sets of compute nodes with the Compute Rolling Upgrade Service (CRUS) without requiring an entire set of nodes to be out of service at once. CRUS enables administrators to limit the impact on production caused from upgrading compute nodes by working through one step of the upgrade process at a time.

Compute Rolling Upgrade Service (CRUS)
CRUS Workflow
Upgrade Compute Nodes with CRUS
Troubleshoot Nodes Failing to Upgrade in a CRUS Session
Troubleshoot a Failed CRUS Session Because of Unmet Conditions
Troubleshoot a Failed CRUS Session Because of Bad Parameters

Configuration Management

The Configuration Framework Service (CFS) is available on systems for remote execution and configuration management of nodes and boot images.

Configuration Management
Configuration Layers
- Create a CFS Configuration
- Update a CFS Configuration
Ansible Inventory
- Manage Multiple Inventories in a Single Location
Configuration Sessions
Configuration Management with the CFS Batcher
Configuration Management of System Components
Ansible Execution Environments
- Use a Custom ansible-cfg File
- Enable Ansible Profiling
CFS Global Options
Version Control Service (VCS)
Write Ansible Code for CFS
- Target Ansible Tasks for Image Customization
Ansible Inventory
- Manage Multiple Inventories in a Single Location
Configuration Sessions
Configuration Management with the CFS Batcher
Configuration Management of System Components
Ansible Execution Environments
- Use a Custom ansible-cfg File
- Enable Ansible Profiling
CFS Global Options
Version Control Service (VCS)
Write Ansible Code for CFS * Target Ansible Tasks for Image Customization

Kubernetes

The system management components are broken down into a series of micro-services. Each service is independently deployable, fine-grained, and uses lightweight protocols. As a result, the system's micro-services are modular, resilient, and can be updated independently. Services within the Kubernetes architecture communicate via REST APIs.

Kubernetes Architecture
About kubectl
- Configure kubectl Credentials to Access the Kubernetes APIs
About Kubernetes Taints and Labels
Kubernetes Storage
Kubernetes Networking
Retrieve Cluster Health Information Using Kubernetes
Pod Resource Limits
About etcd
About Postgres

Package Repository Management

Repositories are added to systems to extend the system functionality beyond what is initially delivered. The Sonatype Nexus Repository Manager is the primary method for repository management. Nexus hosts the Yum, Docker, raw, and Helm repositories for software and firmware content.

Package Repository Management
Package Repository Management with Nexus
Manage Repositories with Nexus
Nexus Configuration
Nexus Deployment
Restrict Admin Privileges in Nexus
Repair Yum Repository Metadata

Security and Authentication

Mechanisms used by the system to ensure the security and authentication of internal and external requests.

System Security and Authentication
Manage System Passwords * Update NCN Passwords * Change Root Passwords for Compute Nodes * Change NCN Image Root Password and SSH Keys
SSH Keys
Authenticate an Account with the Command Line
Default Keycloak Realms, Accounts, and Clients
Public Key Infrastructure (PKI)
Public Key Infrastructure (PKI)
Troubleshoot SPIRE Failing to Start on NCNs
API Authorization

Resiliency

HPE Cray EX systems are designed so that system management services (SMS) are fully resilient and that there is no single point of failure.

Resiliency
Resilience of System Management Services
Restore System Functionality if a Kubernetes Worker Node is Down
Recreate StatefulSet Pods on Another Node
NTP Resiliency

ConMan

ConMan is a tool used for connecting to remote consoles and collecting console logs. These node logs can then be used for various administrative purposes, such as troubleshooting node boot issues.

Access Compute Node Logs
Access Console Log Data Via the System Monitoring Framework (SMF)
Manage Node Consoles
Log in to a Node Using ConMan
Establish a Serial Connection to NCNs
Disable ConMan After System Software Installation
Troubleshoot ConMan Blocking Access to a Node BMC
Troubleshoot ConMan Failing to Connect to a Console
Troubleshoot ConMan Asking for Password on SSH Connection

Utility Storage

Ceph is the utility storage platform that is used to enable pods to store persistent data. It is deployed to provide block, object, and file storage to the management services running on Kubernetes, as well as for telemetry data coming from the compute nodes.

Utility Storage
Collect Information about the Ceph Cluster
Manage Ceph Services
Adjust Ceph Pool Quotas
Add Ceph OSDs
Shrink Ceph OSDs
Ceph Health States
Dump Ceph Crash Data
Identify Ceph Latency Issues
Cephadm Reference Material
Restore Nexus Data After Data Corruption
Troubleshoot Failure to Get Ceph Health
Troubleshoot a Down OSD
Troubleshoot Ceph OSDs Reporting Full
Troubleshoot System Clock Skew
Troubleshoot an Unresponsive S3 Endpoint
Troubleshoot Ceph-Mon Processes Stopping and Exceeding Max Restarts
Troubleshoot Pods Failing to Restart on Other Worker Nodes
Troubleshoot Large Object Map Objects in Ceph Health
Troubleshoot Failure of RGW Health Check

System Management Health

Enable system administrators to assess the health of their system. Operators need to quickly and efficiently troubleshoot system issues as they occur and be confident that a lack of issues indicates the system is operating normally.

System Management Health
System Management Health Checks and Alerts
Access System Management Health Services
Configure Prometheus Email Alert Notifications

System Layout Service (SLS)

The System Layout Service (SLS) holds information about the system design, such as the physical locations of network hardware, compute nodes, and cabinets. It also stores information about the network, such as which port on which switch should be connected to each compute node.

System Layout Service (SLS)
Dump SLS Information
Load SLS Database with Dump File
Add UAN CAN IP Addresses to SLS
- Update SLS with UAN Aliases
Create a Backup of the SLS Postgres Database
Restore SLS Postgres Database from Backup
Restore SLS Postgres without an Existing Backup

System Configuration Service

The System Configuration Service (SCSD) allows administrators to set various BMC and controller parameters. These parameters are typically set during discovery, but this tool enables parameters to be set before or after discovery. The operations to change these parameters are available in the Cray CLI under the scsd command.

System Configuration Service
Configure BMC and Controller Parameters with SCSD
Manage Parameters with the scsd Service
Set BMC Credentials

Hardware State Manager (HSM)

Use the Hardware State Manager (HSM) to monitor and interrogate hardware components in the HPE Cray EX system, tracking hardware state and inventory information, and making it available via REST queries and message bus events when changes occur.

Hardware State Manager (HSM)
Hardware Management Services (HMS) Locking API
- Lock and Unlock Management Nodes
- Manage HMS Locks
Component Groups and Partitions
Hardware State Manager (HSM) State and Flag Fields
HSM Roles and Subroles
Add an NCN to the HSM Database
Add a Switch to the HSM Database
Create a Backup of the HSM Postgres Database
Restore HSM Postgres from a Backup
Restore HSM Postgres without a Backup

Hardware Management (HM) Collector

The Hardware Management (HM) Collector is used to collect telemetry and Redfish events from hardware in the system.

Adjust HM Collector resource limits and requests

Node Management

Monitor and manage compute nodes (CNs) and non-compute nodes (NCNs) used in the HPE Cray EX system.

Node Management
Node Management Workflows
Rebuild NCNs
Reboot NCNs
- Check and Set the metalno-wipe Setting on NCNs
Enable Nodes
Disable Nodes
Find Node Type and Manufacturer
Add a Standard Rack Node
- Move a Standard Rack Node
- Move a Standard Rack Node (Same Rack/Same HSN Ports)
Clear Space in Root File System on Worker Nodes
Manually Wipe Boot Configuration on Nodes to be Reinstalled
Troubleshoot Issues with Redfish Endpoint DiscoveryCheck for Redfish Events from Nodes
Reset Credentials on Redfish Devices
Access and Update Settings for Replacement NCNs
Change Settings for HMS Collector Polling of Air Cooled Nodes
Use the Physical KVM
Launch a Virtual KVM on Gigabyte Servers
Launch a Virtual KVM on Intel Servers
Change Java Security Settings
Verify Accuracy of the System Clock
Configuration of NCN Bonding
- Change Interfaces in the Bond
- Troubleshoot Interfaces with IP Address Issues
Troubleshoot Loss of Console Connections and Logs on Gigabyte Nodes
Check the BMC Failover Mode
Update Compute Node Mellanox HSN NIC Firmware
TLS Certificates for Redfish BMCs
- Add TLS Certificates to BMCs
Dump a Non-Compute Node
Enable Passwordless Connections to Liquid Cooled Node BMCs
- View BIOS Logs for Liquid Cooled Nodes
Configure NTP on NCNs

River Endpoint Discovery Service (REDS)

The River Endpoint Discovery Service (REDS) performs geolocation and initialization of compute nodes, based on a mapping file that is provided with each system.

Configure a Management Switch for REDS
Initialize and Geolocate Nodes
Verify Node Removal
Troubleshoot Common REDS Issues
- Troubleshot Common Error Messages in REDS Logs
- Clear State and Restart REDS

Network

Overview of the several different networks supported by the HPE Cray EX system.

Network
Access to System Management Services
Default IP Address Ranges
Connect to the HPE Cray EX Environment

Management Network

HPE Cray EX systems can have network switches in many roles: spine switches, leaf switches, aggregation switches, and CDU switches. Newer systems have HPE Aruba switches, while older systems have Dell and Mellanox switches. Switch IP addresses are generated by Cray Site Init (CSI).

Management Network Switch Rename
Management Network ACL configuration
Management Network CAN setup
Management Network Flow Control Settings
Management Network Access Port configurations
Update Management Network Firmware

Customer Access Network (CAN)

The Customer Access Network (CAN) provides access from outside the customer network to services, NCNs, and User Access Nodes (UANs) in the system.

Customer Access Network (CAN)
Required Labels if CAN is Not Configured
Externally Exposed Services
Connect to the CAN
CAN with Dual-Spine Configuration
Troubleshoot CAN Issues

Dynamic Host Configuration Protocol (DHCP)

The DHCP service on the HPE Cray EX system uses the Internet Systems Consortium (ISC) Kea tool. Kea provides more robust management capabilities for DHCP servers.

DHCP
Troubleshoot DHCP Issues

Domain Name Service (DNS)

The central DNS infrastructure provides the structural networking hierarchy and datastore for the system.

DNS
Manage the DNS Unbound Resolver
Enable ncsd on UANs
Troubleshoot Common DNS Issues
Troubleshoot DNS Configuration Issues

External DNS

External DNS, along with the Customer Access Network (CAN), Border Gateway Protocol (BGP), and MetalLB, makes it simpler to access the HPE Cray EX API and system management services. Services are accessible directly from a laptop without needing to tunnel into a non-compute node (NCN) or override /etc/hosts settings.

External DNS
External DNS csi config init Input Values
Update the system-name.site-domain Value Post-Installation
Update the can-external-dns Value Post-Installation
Ingress Routing
Add NCNs and UANs to External DNS
External DNS Failing to Discover Services Workaround
Troubleshoot Connectivity to Services with External IP addresses
Troubleshoot DNS Configuration Issues

MetalLB in BGP-Mode

MetalLB is a component in Kubernetes that manages access to LoadBalancer services from outside the Kubernetes cluster. There are LoadBalancer services on the Node Management Network (NMN), Hardware Management Network (HMN), and Customer Access Network (CAN).

MetalLB can run in either Layer2-mode or BGP-mode for each address pool it manages. BGP-mode is used for the NMN, HMN, and CAN. This enables true load balancing (Layer2-mode does failover, not load balancing) and allows for a more robust layer 3 configuration for these networks.

MetalLB in BGP-Mode
MetalLB in BGP-Mode Configuration
Check BGP Status and Reset Sessions
Update BGP Neighbors
Troubleshoot Services without an Allocated IP Address
Troubleshoot BGP not Accepting Routes from MetalLB

Update Firmware with FAS

The Firmware Action Service (FAS) provides an interface for managing firmware versions of Redfish-enabled hardware in the system. FAS interacts with the Hardware State Managers (HSM), device data, and image data in order to update firmware.

See Update Firmware with FAS for a list components that are upgradable with FAS. Refer to the HPC Firmware Pack (HFP) product stream to update firmware on other components.

Update Firmware with FAS
FAS CLI
FAS Filters
FAS Recipes
FAS Admin Procedures
FAS Use Cases
Upload Olympus BMC Recovery Firmware into TFTP Server
Install HPC Firmware Pack (HFP)
Install HPC Firmware Pack from PIT or LiveCD

User Access Service (UAS)

The User Access Service (UAS) is a containerized service managed by Kubernetes that enables application developers to create and run user applications. Users launch a User Access Instance (UAI) using the cray command. Users can also transfer data between the Cray system and external systems using the UAI.

User Access Service (UAS)
End-User UAIs
Special Purpose UAIs
Elements of a UAI
UAI Host Nodes
UAI macvlans Network Attachments
UAI Host Node Selection
UAI Network Attachments
Configure UAIs in UAS
UAI Management
Legacy Mode User-Driven UAI Management
Broker Mode UAI Management
UAI Images
- Customize the Broker UAI Image
- Customize End-User UAI Images
Troubleshoot UAS Issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

CSM Operational Activities

CSM Product Management

Image Management

Boot Orchestration

System Power Off Procedures

System Power On Procedures

Power Management

Artifact Management

Compute Rolling Upgrades

Configuration Management

Kubernetes

Package Repository Management

Security and Authentication

Resiliency

ConMan

Utility Storage

System Management Health

System Layout Service (SLS)

System Configuration Service

Hardware State Manager (HSM)

Hardware Management (HM) Collector

Node Management

River Endpoint Discovery Service (REDS)

Network

Management Network

Customer Access Network (CAN)

Dynamic Host Configuration Protocol (DHCP)

Domain Name Service (DNS)

External DNS

MetalLB in BGP-Mode

Update Firmware with FAS

User Access Service (UAS)

Files

index.md

Latest commit

History

index.md

File metadata and controls

CSM Operational Activities

CSM Product Management

Image Management

Boot Orchestration

System Power Off Procedures

System Power On Procedures

Power Management

Artifact Management

Compute Rolling Upgrades

Configuration Management

Kubernetes

Package Repository Management

Security and Authentication

Resiliency

ConMan

Utility Storage

System Management Health

System Layout Service (SLS)

System Configuration Service

Hardware State Manager (HSM)

Hardware Management (HM) Collector

Node Management

River Endpoint Discovery Service (REDS)

Network

Management Network

Customer Access Network (CAN)

Dynamic Host Configuration Protocol (DHCP)

Domain Name Service (DNS)

External DNS

MetalLB in BGP-Mode

Update Firmware with FAS

User Access Service (UAS)