Skip to content

Latest commit

 

History

History
1404 lines (1293 loc) · 78.3 KB

CHANGELOG.md

File metadata and controls

1404 lines (1293 loc) · 78.3 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

Unreleased

  • Update cray-dns-unbound to 0.7.18 (CASMTRIAGE-4913)
  • Release cray-istio, cray-istio-deploy, cray-istio-operator, and cray-kiali charts to support istio 1.11.8 (CASM-3619)
  • Update cray-dns-unbound to 0.7.17 (CASMNET-2048)
  • Update cf-gitea-import to 1.9.1 (CASMINST-5954)
  • Update cf-gitea-update to 1.0.4 (CASMINST-5955)(CASMINST-5956)
  • Update cray-nls helm chart to 1.4.60 (CASMINST-5988,CASMPET-6336)
  • Update cray-keycloak to 4.1.1 to fix Keylcoak ldap certificate bug (CASMPET-6305)
  • Update iuf-cli 1.0.0-1 to 1.5.0_alpha-1
  • Update cf-gitea-import to 1.9.0 (CASMINST-5843)
  • Update cf-gitea-update to 1.0.3 (CASMINST-5843)
  • Update cray-ims-load-artifacts to 2.1.0 (CASMINST-5843)
  • Update cray-nls helm chart to 1.4.54 (CASMINST-5938)
  • Update cray-velero 1.7.1 to 1.7.1-1 (CASMPET-5779)
  • Update cray-keycloak to 4.1.0 to use the upgraded postgres operator logical backup (CASMPET-6228)
  • Add cf-gitea-import 1.8.1 (CASMINST-5866)
  • Uninstall cray-crus when upgrading to CSM 1.6
  • Remove cray-crus chart (CRUS removed in CSM 1.6)
  • Release csm-testing v1.16.3, CASMINST-5850 and CASMINST-5819
  • Release cray-postgres-operator 1.8.5 minor bug fixes
  • Update craycli to 0.67.0, cray-cfs-api to 1.12.1 (CASMCMS-8380)
  • Update cray-keycloak to 4.0.0 (CASMPET-6079)
  • Add cfs-ara 1.0.0 to the manifest (CASMCMS-7690)
  • Release cray-postgres-operator 1.8.3 to pull in post-upgrade and post-install jobs
  • Release csm-testing v1.16.2, Log test duration in decimal without scientific notation (CASMINST-5793)
  • Release cray-postgres-operator 1.8.2 to pull in latest zalando postres operator
  • Release cray-psp 0.4.2 needed along with cray-postgres-operator change
  • Release cray-nls 0.4.41 to include functionality for storage rebuild workflow (CASMINST-5745)
  • Release cray-ims-load-artifacts 2.0.1 to update image used in IUF deliver-product stage (CASM-3718)
  • Updated cfs-operator to collapse all session layers and collect logs with ARA
  • Release cray-nls 1.4.25 to update image used in iufBase.template.yaml (CASMINST-5621)
  • Built pre-install-toolkit, kubernetes, storage-ceph without changes
  • Release platform-utils v1.4.3 to edit output from ncnHealthChecks.sh (CASMTRIAGE-4539)
  • Released cray-nls v1.4.8 to add configmap for iuf-install-workflow-files (CASMINST-5530)
  • Released cray-keycloak-users-localize v1.11.3 to fix keycloak localize never ending (CASMTRIAGE-4286)
  • Updated gatekeeper-audit memory limits (CASMTRIAGE-4311)
  • Released platform-utils v1.4.2 to fix issue with fix-spire-on-storage.sh (CASMPET-6033)
  • Added kyverno images to Nexus precache (CASMPET-6035)
  • Updated gitea to 2.5.1 for security fixes
  • Release csm-testing v1.15.15, fixed BSS test for initrd= boot parameter (CASMTRIAGE-4269)
  • Release csm-testing v1.15.14, fix check_for_unused_drives.py failing (CASMTRIAGE-4185)
  • Updated cfs-api, cfs-operator, cfs-batcher and cfs-hwsync for fixes to many minor tickets
  • Released cray-nexus v0.11.1 to update nexus-setup to 0.7.1 (CASMPET-6016)
  • Released cray-kyverno 1.3.0 to enable required anti-affinity deployment (CASMPET-6008)
  • Released platform-utils v1.4.1 to fix issue with etcd_restore_rebuild.sh
  • Released cray-etcd-backup 0.4.3 to add backupPolicy.timeoutInSecond (CASMTRIAGE-4188)
  • Released cray-nls 1.4.1 to fix postgres database restore issue (CASMPET-5960)
  • Released spire 2.10.1 to fix postgres database restore issue (CASMPET-5961)
  • Released cray-keycloak 3.6.1 to fix postgres database restore issue (CASMPET-5936)
  • Update csm-config, cray-crus and console-node for to use sp4 base images (CASMCMS-8076)
  • Update craycli to 0.63.0 to fix python 3.6 deprecation warning
  • Release platform-utils v1.4.0, removes duplicate copy of ceph-service-status.sh from utils
  • Released cray-nls-charts 1.4.0 to add new mount for storage workflows
  • Update cfs api, operator and trust for pod priority escalation
  • Released goss-servers/csm-testing v1.14.47 for ceph goss test failure output change
  • Released goss-servers/csm-testing v1.14.46 for goss_check_static_routes fix
  • Released spire 2.10.0 and cray-nls 1.3.8 for security fixes (CASMPET-5873)
  • Added all istio images to Nexus precache (CASMPET-5888)
  • Released cfs-operator 1.16.0 to fix issues with additional inventory
  • Released cray-keycloak-users-localize v1.11.2 to fix keycloak localize not copying all users to /etc/passwd (CASMPET-5743)
  • Released csm-utils v1.3.5 for recent ncnHealthChecks etcd fixes
  • Released goss-servers/csm-testing v1.14.36 for goss etcd fixes
  • Update update-uas to v1.7.4 - Update gateways tests to include HMN gateway (CASMNET-1741)
  • Released cray-opa 1.22.0 to whitelist keycloak for hmn (CASMPET-5860)
  • Released csm-utils v1.3.4 for recent changes
  • Released new cray-istio, cray-istio-deploy, cray-istio-operator, and cray-kiali charts to support istio 1.10.6 (CASMPET-5796)
  • Released cray-keycloak 3.6.0 to add keycloak service to hmn-gateway (CASMPET-5812)
  • Updated cfs-operator to 1.15.0 to fix kafka client initialization
  • Changed how sls search API generates SQL (CASMHMS-5488)
  • Fixed kafka errors in hms-trs-operator (CASMHMS-5525)
  • Added auth.hmnlb.SYSTEM_DOMAIN annotation for istio-ingressgateway-hmn in customizations.yaml (CASMPET-5817)
  • Added allowed-issuers for istio-ingressgateway-hmn in cray-opa section customizations.yaml (CASMPET-5817)
  • Update craycli to 0.62.0
  • Update craycli to 0.57.0
  • Added api.hmnlb.SYSTEM_DOMAIN annotation for istio-ingressgateway-hmn in customizations.yaml (CASMPET-5795)
  • Updated cray-oauth2-proxy to use the latest image for sec vulnerability (CASMPET-5697)
  • Released cray-istio-deploy 1.27.2 and cray-istio 2.6.3 to increase istiod replica count (CASMPET-5621)
  • Update craycli to 0.56.0
  • Fix for sma storage class missing image feature layering when upgraded from 1.0.x and earlier
  • Adding velero upgrade to 1.6.3 and an additional manifest for further upgrade velero to 1.7.1
  • Released cray-postgres-operator 1.0.0 to fix issue with postgres pods restarting (CASMPET-5725)
  • Released cray-opa 1.17.0 to add OPA rules for read-only monitoring role (CASMPET-5664)
  • Adding ceph versions 15.2.16 and 16.2.9
  • Adding k8s version 1.21.12, coredns v1.8.0, and pause 3.4.1
  • Released cray-keycloak 3.5.0 to add a read-only monitoring role (CASMPET-5660)
  • Released cray-node-problem-detector 1.9.0 to use a newer image (CASMPET-5555)
  • Update csm-config v1.9.31 for bifurcated CAN enablement play (CASMNET-1528)
  • Released sealed-secrets 0.3.0 to use new image location (CASMPET-5602)
  • Released spire 2.5.0 for sec vulnerability and image auto rebuild (CASMINST-4505)
  • Update update-uas to v1.6.1 - Updated test in cray-uai-gateway-test image
  • Released cray-nexus 0.10.2 to fix auto rebuild of nexus image (CASMPET-5591)
  • Released cray-postgres-operator 0.14.0 to trigger image auto rebuild (CASMPET-5567)
  • Released cray-node-discovery 1.2.4 for sec vulnerability (CASMPET-5566)
  • Released csm-utils v1.2.9 for recent changes
  • Update cray-oauth2-proxy to use CSM built container image (CASMPET-5534)
  • Released csm-utils v1.2.8 for recent changes
  • Update update-uas to v1.6.0 - Adding cray-uai-gateway-test image
  • Update cray-drydock 1.12.2 - adding kyverno namespace
  • Update trustedcerts-operator to 0.6.0 to use latest alpine 3 image (CASMPET-5485)
  • Istio is updated to 1.9.9, OPA envoy plugin to 0.26.0-envoy-6, kiali to 1.33.1 (CASMPET-5303, CASMPET-5359)
  • Released csm-testing v1.12.22 for recent test changes
  • Released csm-testing v1.13.0 for recent test changes
  • Shared Broker UAI credentials: cray-uas-mgr v1.19.1, update-uas v1.4.0, switchboard v2.1.0
  • Released cray-sysmgmt-health v0.21.7 to adjust istio alert rules (CASMPET-5374)
  • Update gitea to fix tls chart upgrade problems
  • Update csm-config v1.9.24 for CASMCMS-7890
  • Released csm-testing v1.12.9 for recent test changes
  • Update cray-oauth2-proxy for sec vulnerability. CASMINST-4080
  • Update cray-node-discovery for sec vulnerability
  • Update cray-externaldns to use updated image path (CASMINST-4085)
  • Update cray-ceph-csi-rbd to support rolling upgrade strategy. CASMINST-3857
  • Update cray-ceph-csi-cephfs to support rolling upgrade strategy. CASMINST-3857
  • Update cfs-api to 1.10.1 to add api validation and remove v1 api (CASMCMS-7806)
  • Update cray-psp to 0.3.0: remove obsolete CluterRoleBinding from CAST-27468
  • Update cfs-api to 1.9.5 to add pod anti-affinity: CASMINST-3913
  • Update cray-uas-mgr to 1.18.0: address CAST-27468
  • Updated cfs services and cli to add Ansible passthrough parameter (CASMCMS-7784)
  • Updated cfs-operator to 1.14.11 to pull in fix for image customization teardown (CASMTRIAGE-2909)
  • Released cray-sysmgmt-health v1.2.18 to fix license headers
  • Remove pvc-migrator from docker manifest
  • Update cray-sysmgmt-health for ghostunnel sec vulnerability
  • Update Kafka strimzi operator to 0.27.0
  • Removed unused craycli docker image from docker manifest
  • Released goss-servers/csm-testing v1.8.43 for ca cert test fix
  • Released csm-testing v1.8.40 for recent test changes
  • Updated craycli to 0.45.0 to pick up support for CSM-1.2 UAS functionality
  • Update cfs-operator to 1.14.9 to pull in latest alpine/git image (CASMCMS-7725)
  • Update cfs-operator to 1.14.6 to pull in fresh aee image (CASMTRIAGE-2853)
  • Updated cray-uas-mgr to pick up the following:
    • BiCAN support for UAS/UAI
    • External DNS support for public facing UAIs
    • Multi-Replica UAI support in UAS
    • UAI Timeout configuration in UAI Classes (automatic UAI termination)
    • Various bug fixes
  • Released platform-utils-1.2.5 to fix etcd restore script
  • Released csm-testing v1.8.33 to remove preflight tests
  • Updated craycli to 0.41.11 to add support for new SCSD subcommand
  • oauth2-proxy is now used in place of keycloak-gatekeeper.
  • Updated cray-nexus to 0.8.0 to add ingress gateways
  • Released csm-testing v1.8.31 for precache image test
  • Released cray-keycloak:2.1.0 for added Bi-CAN gateways
  • Released cray-opa v1.3.0 to Add BOS entries and fix xname validation
  • Released csm-testing v1.8.29 for ncn-kubernetes-checks pit fix
  • Released cray-s3:1.0.0 to move service endpoint to CMN for Bi-CAN
  • Istio is updated to version 1.8.6, Kiali to 1.28.1
  • Released csm-testing v1.8.28 for improvements to velero failed backup and monitoring tests
  • Updated cray-opa to 1.2.0 to add policy check for xforward as used by oauth2-proxy
  • Updated three new tests to only run after PIT has been rebooted
  • Updated cray-site-init to 1.9.12 to add CHN network
  • Added no_wipe test to ncn-healthcheck suites
  • Updated spire health storage suite to only run after the PIT has been rebooted
  • Automated goss test improvements and additions
  • The Jaeger service for tracing HTTP requests is no longer deployed.
  • Istio no longer deploys its own Prometheus. The cray-sysmgmt-health Prometheus does the monitoring.
  • Updated cray-uai-broker to 1.2.4 integration with improved update-uas
  • Updated cray-uai-sles15sp2 to 1.2.4 integration with improved update-uas
  • Updated update-uas to 1.2.4: broker registration + class/default image sync
  • Updated cray-uas-mgr to 1.13.3 to pick up CVE-2021-3711 fix
  • Updated cray-keycloak and cray-keycloak-users-localize to pick up security fixes (CVE-2021-3711)
  • Updated manifestgen to 1.3.3 and moved it to algol60 artifactory
  • Updated cray-drydock to 2.9.0 to increase sonar-jobs-watcher cpu resource limits
  • Updated cray-uas-mgr to 1.13.1 to pick up CVE fixes
  • Updated update-uas to 1.0.12 to pick up CVE fixes
  • Updated cray-postgres-operator to 0.11.2 to disable pod disruption policy
  • Updated cray-opa to 0.16.0 for tighter API access enforcement
  • Updated cray-istio-deploy version 1.21.0 to set pod priority class
  • Updated cray-istio-operator version 1.21.0 to set pod priority class
  • Updated cray-istio version 1.27.0 to set pod priority class
  • Updated cray-etcd-operator to 0.16.0 to pull in github built chart/images
  • Updated cray-metallb to 0.13.0 to to set pod priority class
  • Updated cray-postgres-operator to 0.11.1 to pull in github built chart/images
  • Updated cray-spire to 0.14.0 to set pod priority class
  • Updated cray-nexus to 0.7.0 to set pod priority class
  • Updated cray-opa to 0.15.0 to set pod priority class
  • Updated cray-postgres-operator to 0.11.0 to set pod priority class
  • Updated cray-keycloak-gatekeeper to 0.4.0 to set pod priority class
  • Updated cray-drydock to 2.7.1 to increase sonar-jobs-watcher cpu resource limits
  • Updated cray-postgres-operator to 0.10.1 to disable pod disruption policy
  • Fixed PROXY_ADDRESS_FORWARDING missing in cray-keycloak
  • Updated cray-keycloak to set JVM option to avoid heap allocation error
  • Updated cray-dns-unbound to 0.3.0 to include liveness/readiness probe fixes
  • Updated cray-sysmgmt-health for ghostunnel security vulnerabilities
  • Updated cray-sysmgmt-health for ghostunnel security vulnerabilities
  • Updated cray-opa to move spire validation from audience to subject
  • Updated cray-spire to set kdump related workloads' TTL to 10 days
  • Updated cray-spire-intermediate for security vulnerabilities
  • Updated cray-spire to 0.12.1 to enable automatic backups
  • Updated cray-keycloak to automatically back up the database
  • Updated cray-externaldns to use the pdns provider and populate PowerDNS
  • Updated cray-dns-powerdns and cray-powerdns-manager to support CAN DNS LoadBalancer move
  • Updated cray-node-discovery for security vulnerabilities
  • Updated cray-sts for security vulnerabilities
  • Updated cray-istio charts to use distroless images by default
  • Updated cray-sysmgmt-health to 0.12.2 to add postgres alerts
  • Updated cray-keycloak to 1.11.5 and cray-keycloak-users-localize to 1.6.1 for base os change in cray-keycloak-setup image
  • Updated cray-sysmgmt-health to 0.12.1 to pickup prometheus alert fixes
  • Updated cray-shared-kafka to 0.5.0 to pickup auto cert renewal
  • Updated cray-spire to 0.10.0 to pickup postgres pvc resize
  • Updated cray-keycloak to 1.11.4 to pickup postgres pvc resize
  • Updated cray-uas-mgr to 1.12.1
  • Updated cray-postgres-operator to pull in new images for security vulnerabilities
  • Updated Istio to 1.7.8
  • Add HTTPs support to the istio-ingressgateway-hmn
  • Updated cray-keycloak with resource changes for wait-for-postgres pod
  • The sonar-job-watcher not uses crictl to stop the istio sidecar
  • Updated cache/postgres image in cray-keycloak for security vulnerabilities
  • Added prometheus alerts for monitoring replication lag across postgres clusters
  • Allow CFS/AEE to read both secrets and configmaps via an updated role
  • Updated resource limits for CAPMC service
  • Removed customizations.yaml no longer overrides prometheus resources
  • Added manifest.yaml provides initial overrides for prometheus resources
  • Updated BOS/BOA and CFS-operator to use credentials when cloning from VCS.
  • Fixed the SLS Loader job to use a more robust method to determine the IP address of rgw-vip.nmn.
  • Updated the SLS service to have 3 replicas.
  • Fixed configuration status reporting for configuration details of components.
  • Updated FAS, RTS, and hms-discovery with security fixes.
  • Updated cray-hms-trs-operator with security fix for jwt-go vulnerability.
  • Updated MEDS to allow proper CMM xnames.
  • Updated CSI for user provided Application node prefix to HSM SubRole mappings to take precedence over the defaults within in CSI.
  • Updated HMS CT test RPM to include several fixes and new tests.
  • Fixed several CFS bugs around creating and querying sessions
  • Fixed CFS-Batcher bug that was causing extra sessions to be launched
  • Updated MEDS will now only make POST and PATCH a EthernetInterface in HSM when there is actually something to change.
  • Fixed RTS to have the correct pod security policies for the RTS Loader Job.
  • Updated power capping control for Olympus nodes
  • Updated console services to use HSM v2 api.
  • Added anti-affinty settings to cray-console-node pods.
  • Updated HMS CT tests for Helm test and removed old RPMs.
  • Updated IMS to provide a larger default image size (CASMCMS-8015)
  • Updated console services to handle Hill nodes.
  • Updated barebones image to use slessp4.
  • Added configurable timeout to IMS service to allow handling large images.
  • Added an option to enable DKMS to IMS service.

0.9.0 - 2021-03-17

  • Fixed WAR script that adjusts partition sizes for k8s.
  • Added admonition to ensure release patch is applied.
  • Fixed typos and mispellings in install docs.
  • Added instructions to download latest updated documentation and workarounds.
  • Updated script name in known issues section of 006-CSM-PLATFORM-INSTALL.
  • Updated instructions for getting FAS status.

0.9.0-rc.4 - 2021-03-15

  • Fixed a version string mismatch for UAN iLO Firmware
  • Added reboot persistence in a WAR script for “neighbour: arp_cache: neighbor table overflow!”.
  • Changed a WAR procedure for the routing population to use only token-based authentication.
  • Fixed the documentation for HPE firmware update by adding missing steps on how to provide the firmware to ncn-m001.
  • Fixed the metal install documentation for when ncn-kubernetes-checks fails for: Worker Node sdc Drive.
  • Fixed the documentation to describe what to do when the ncnPostgresHealthChecks.sh produces several "ERROR: get_cluster" messages.
  • Added a documentation workaround to adjust tcp memory values to avoid out of memory issues.
  • Fixed several issues with CSM install documentation based on internal install testing.
  • Added documented directions for Gigabyte firmware on the installation firmware pages.
  • Changed the installer by refactoring it to address client connections to SLS requiring spire.
  • Fixed the install documentation in the reboot section to make sure the WAR for the Goss system fix is available for Goss testing.
  • Changed the WAR script to disable fstrim cron.weekly by adding pdsh usage.
  • Populated this CHANGELOG.md.
  • Updated patch instructions to include:
    • Steps to create a new release distribution for use with current installation procedures.
    • Warnings about Git version dependency.
  • Updated vendored docs to be consistent with docs-csm-install RPM at commit 75f9a03.
  • Updated resources limits for CAPMC service

0.9.0-rc.3 - 2021-03-14

  • Added several networking related documentation updates for gaps or misses.
  • Updated the reboot section in the CSM install documentation because step 5 did not work for Gigabyte nodes.
  • Added the kernel-default-debuginfo rpm to CSM.
  • Updated the Platform Installation section of the CSM Install docs to reword the instructions as clarifications for install.sh.
  • Added documentation for adjust partition sizes on dmk8s.
  • Added documentation to add a notice about a delay on certain hardware when setting BMCs to dhcp.
  • Fixed the script for a WAR that fixes HSN NICs missing on NCNS. The tool, pdsh is now used to allow the script to run on multiple NCN’s simultaneously.
  • Added fixes to firmware documentation based on feedback from internal testing.
  • Updated documentation to revert a change in the wipe commands that did not work throughout the installation documentation.
  • Updated documentation to fix an ipv6 autoconfiguration issue with Aruba switches.
  • Added fix to make firmware packages available in the CSM release rather than an external link.
  • Added a fix for the command for step 3 at end of install.sh part A is missing the list of NCNs.
  • Added a change to the wipe documentation because extended globbing is no longer needed.
  • Added a fix to the platform install documentation to make sure to tag the skopeo image when loading it.
  • Added fix to the script for the WAR for neighbour: arp_cache: neighbor table overflow!. The tool, pdsh was missing from the command.
  • Added a readme file for the WAR script for adding AcceptEnv and SendEnv config file settings for IPMITOOL_PASSWORD.

0.9.0-rc.2 - 2021-03-11

  • Added a resiliency fix for the cfs-trust pod being CPU throttled.
  • Added a fix for cfs-trust extended duration hardening.
  • Added documentation for a workaround for cray-cfs Service Account missing permissions for AEE to read secrets.
  • Removed documentation for a work-around for NCNs booted from k8s boot services not getting vlan004 and vlan007 reserved IPs. The fix was implemented
  • Added a scalability fix to weave by changing MTU to 1376 to prevent falling back to sleeve mode.
  • Updated documentation for NCN wipe documentation and associated clean-up.
  • Added documentation for a work-around that covers, symptoms of HSN names being wrong' how to rename them immediately and making the rename persistent.
  • Added documentation for a work-around to describe the process to download the required HPE node firmware for v1.4 from public HPE support pages.
  • Added documentation and a script for a war to tune sysctl for better ARP cache.
  • Added documentation to the install guide to reassure that it is ok for cray-crus to be in init state.
  • Added documentation and a script for a workaround to add AcceptEnv and SendEnv config file settings for IPMITOOL_PASSWORD.
  • Added documentation to CSM install to have sample output from wipefs as a reference to what to expect,
  • Added documentation to CSM install to fix incorrect wipefs commands.

0.9.0-rc.1 - 2021-03-10

  • Added clarification documentation for spanning -tree configurations on the management network.
  • Added a resiliency fix for IMS by changing its storage class to ceph-cephfs-external.
  • Added a resiliency fix for gitea-vcs by removing a pvc customization.
  • Added clarification documentation reading a degraded system in the CSM Install document.
  • Added documentation updates based on an internal installation testing,
  • Added a fix for a typo in command output from upload-sls.go
  • Added a fix to CSI configuration input to support a CDU switch in River cabinets.

0.8.22 - 2021-03-08

  • Added a documented WAR for potential issues with Windom and Grizzly Peak BIOS updates using FAS.
  • Added a change in CSM documentation regarding inconsistent use of flowcontrol for spine switches. Flowcontrol was removed from the examples.
  • Added clarification documentation to the installation guide based on internal testing.
  • Added documentation for a WAR needed to start Goss servers after a boot.
  • Added documentation to clearly state that wiping disks is required for all nodes being reinstalled.
  • Added documentation to provide guidance to wipe disks during the ncn_metadata recovery procedure.

0.8.21 - 2021-03-07

  • Fixed the ncnGetXnames NCN no-wipe check script to use curl commands and to reflect reporting only for NCN nodes.
  • Fixed an issue that was preventing the enablement of kubernetes API audit logging.
  • Fixed the Metal installation documentation for "Start Deployment" in step 7 on how to respond to a failure to obtain a proper hostname after running retry-ci.sh.
  • Fixed a failure in Goss testing by removing the goss-platform-ca-certs-exist test.
  • Fixed Goss tests that would always pass due to bad usage of grep.
  • Added documentation to add more information about REDS, MEDS, and RTS credentials secrets.
  • Fixed tests for DNS checks to not use hardcoded values.
  • Added documentation for a recovery process for an incorrect ncn_metadat.csv file.
  • Added additional fixes for initrd and kernel missing and preventing booting from disk.
  • Added a fix for NCNs other than m001 get stuck in a PXE loop when booting from disk.
  • Added clarification changes to install documentation based on internal testing.
  • Added documentation for a manual validation for vault and spire during installation.
  • Added documentation and a supporting script regarding static root user SSH keypair and authorized keys in NCN images.
  • Added documentation for wiping before rebooting LiveCD.
  • Added documentation for a procedure to adjust boot order and steps to set and trim the UEFI menu.
  • Added a fix to run metalfs before the etcd service to prevent a race condition.
  • Added a fix to adjust kubelet and containerd partions sizes to give them more space.
  • Added a fix to support SAS disks in dracut-metal-mdsquash.
  • Updated documentation to add missing commands to re-run cloud init for m001 in the CSM Reboot installation documentation.

0.8.20 - 2021-03-04

  • Added documentation for reinstallation/rebuilding of Master NCNs.
  • Added helpful scripts to assist with etcd recovery and resource usage.
  • Added a change to the WARNING limit for the hbtd chart to match an increase in the heartbeat daemon.
  • Added documentation to help with REDS/MEDS credentials when moving from a V1.3 system.
  • Updated the documentation to make the wipe step more scalable for system with many NCNs.
  • Added testing to check that sdc is being used correctly on k8s NCNs.
  • Added to the fix for an issue where kea pods were stuck in terminating due to /var/lib/kublet being on overlay.
  • Added a resiliency fix where master/worker/storage nodes are booted from k8s services (vs. from the PIT services), the vlan004 and vlan007 interfaces are getting IPs from the dynamic range of the network rather than their reserved IPs.
  • Added a fix for a potential race condition where containerd started but didn't run it's ExecStartPost on account of an empty multus file.
  • Updated the CSM MTL Install documentation to provide guidance on how long to wait before kubectl get nodes will work.
  • Added documentation to point the customer to the Utility Storage section of the admin guide to address any failed goss tests.
  • Added a script as a fix for adding cray-shasta-mlnx-firmware to LiveCD and make available in the web-root (symlink).
  • Added documentation to ignore sealed secret errors if the occur when running the deploydecryptionkey.sh during platform install.
  • Added a fix for the all and ceph_all inventory groups for systems with more tha three ceph nodes.
  • Updated the CSM Metal installation documentation to included optional and additional validation.
  • Updated the documentation in the installation guide to specify that the check of the /etc/cray/ceph directory must be done on S001 and not any storage node.
  • Updated the CSM-USD LIVECD documentation to make it clearer that the SHASTA-CFG step must be done before proceeding with the Pre-Populate steps.
  • Fixed an issue where multiple container images were missing for nexus for the spire-0.8.11 chart.

0.8.19 - 2021-03-03

  • Added a fix for CPU throttling on cray-opa pods at scale.
  • Added a fix for CPU throttling on cray-sysmgmt-health-prometheus-node-exporter pods at scale.
  • Added a scale fix by moving compute node connections from local ingress gateway to a loadbalancer and removing image prefix.
  • Added a scale fix by removing the imagehost value form the spire chart.
  • Added a fix to add log rotation to conman.
  • Added a fix for CPU throttling on cray-cfs-api pods at scale.
  • Added a fix to add subseconds to the log timestamps.
  • Added a fix for CPU throttling on cray-cfs-operator pods at scale.
  • Added a fix for CPU throttling on cray-conman pods at scale.
  • Added a fix for CPU throttling on cray-hmnfdi pods at scale.
  • Added a fix for CPU throttling on cray-hbtd pods at scale.
  • Updated the install documentation to have the right version of iLO.
  • Added a workaround script for when a initrd and kernel are missing and preventing a boot from disk.
  • Added clarifying documentation because the instructions for verifying BIOS time does not work for HPE nodes,
  • Aadded a fix to revent cloud-init from running on more than the first boot. Documentation and a script were provided.
  • Added several clean-up documentation changes based on install testing.
  • Fixed an issue where a random MAC generator was creating duplicate IP addresses.
  • Fixed an issue where kea podswere stuck in terminating due to /var/lib/kublet being on overlay.
  • Restored a documentation workaround for a cloud-init failure/race condition that was removed from the CSM documentation.
  • Added documentation and a supporting script for a workaround where a joined BCN gets wiped and re-installed to proper IP addresses assigned.
  • Added fixes for a previous workaround script related to compute cabinet routes on the NCNs.

0.8.18 - 2021-03-02

  • Added a fix that re-enables istio-ingressgateway access logs for better supportability.
  • Added a resiliency fix for when VCS/Gitea was taken down it was not able to migrate to a new node due to the inability to re-attach to the PVC.
  • Added a resiliency fix for when cray-cfs-api-db or cray-cfs-api was taken down it was not able to migrate to a new node due to the inability to re-attach to the PVC.
  • Added rewritten Node DNS recheck tests so they didn't use hard-coded values.
  • Added documentation to have a description of booting of NCNs with enough detail to describe the interlocking node configuration dance.
  • Added to the install documentation and callouts to appendix material for SHASTA-CFG to make it easier to follow.
  • Added a debugging fix so that "Waiting for ncn-s002.nmn" to be online" is not written to the serial console every 5-10 seconds, cutting down on console spam.
  • Updated the install documentation with refactor ceph workarounds into the WAR repo.
  • Fixed an issue where boot parameters caused disk-based grub boots to fail.

0.8.17 - 2021-03-01

  • Fixed an issue where the weave-init container was causing 'TargetDown' alerts in prometheus.
  • Fixed an issue with CPU throttling with cray-sysmgmt-health-prometheus-node-exporter.
  • Fixed an issue where at scale Prometheus pods are getting OOM killed, has many restarts, and Liveness/Readiness failures.
  • Updated the release tags for spire and spire intermediate release chart images.
  • Fixed a scale issue where spire attempts too many connections to spire-postgres.
  • Fixed a scale issue where spire with spire the join token expires before all compute nodes are booted.
  • Fixed a scale issue where booting 1000 nodes hits the spire attestation limit.
  • Fixed a scale issue where spire-postgres cpu limits needed to be increased.
  • Fixed a scale issue for spire by adding support for two endpoints.
  • Fixed an issue in meds where the rsyslog server name needs to be configurable to support different uses of the name.
  • Fixed an issue by removing SLES15 SP2 compute node content from CSM. Only SLES15 Sp1 is supported on compute nodes in Shasta V1.4.
  • Fixed issues with ceph for installation on more than 3 storage nodes.
  • Fixed documentation for a WAR for kea DHCP lease issue. The WAR was missing a step to scale down KEA before removing MACs from HSM.
  • Fixed an issue where the basecamp container isn't properly tagged.
  • Added a scripted WAR for the kea DHCP lease issue and moved the order of bss restart.
  • Added documentation for a WAR for an issue with Gigabyte nodes creating a large volume group and a single OSD from the drives which causes the ceph install to fail.
  • Added a script to automatically update nameservers on NCNs, along with instructions to verify.
  • Fixed an issue on reboot failure because Multus config is not copied and master remains "Not Ready".

0.8.16 - 2021-02-26

  • Added a fix for nodes not being able to get hostnames on a system.
  • Added a scaling fix so that every request does not fetch spire certs.
  • Added a scaling fix to adjust OPA timeouts to prevent requests from failing with 503 errors.
  • Added a scaling fix to increase resource limits for cray-opa pods.
  • Updated health check scripts to accommodate pods stuck in either Pending or Terminating state.
  • Added a scaling fix to increase the spire-postgres PVC to support a larger amount of log data.
  • Added functionality and moved the multus config file to avoid an empty multus file on reboot.
  • Reverted some changes and add new functionality to fix an issue when rebooting a master node after a full cluster up failure fails to mount an etcd volume.
  • Added a wrapper function for calls to kubectl to wait for the k8s worker function when nodes are coming up.
  • Added documentation do address instructions to power down the NCNs conflicting with another step.
  • Added a fix to point to the right release tag for a fix for csm-image-recipe-import is ImagePullBackOff that was in CSM 0.8.15.
  • Updated installation documentation based on testing on CSM 0.8.15.

0.8.15 - 2021-02-25

  • Fixed an issue where Nexus Pod was not able to start on worker2 after bond0 goes down on worker1, It created a Multi-attach error on the PVC.
  • Fixed an issue where cray/cray-ims-kiwi-ng-opensuse-x86_64-builder.0.3.4 was missing causing IMS recipe building to fail.
  • Added an update to bump the version number of kiwi-ng-builderfor V1.4.
  • Added documentation to the install documentation to make sure any remaining OSDs get added to the cluster.
  • Fixed documentation to add Mountain/Hill routes and added a supporting script.
  • Fixed a test issue where Goss testing had a failed test with no indication or output related to the test,
  • Fixed Goss test for CSM v0.8.13 which had issues with the repo count tests and running the CA certs test on all storage nodes.
  • Added documentation for a work-around on how to increase the SMA Pool Quota in Ceph.
  • Updated documentation for based restart by moving it to the NCN WAR section.
  • Added a check and work-around documentation fora partial ceph install where the checkpoint files do not exist.

0.8.14 - 2021-02-24

  • Added udev rules to stabilize some of the NCN network interfaces.
  • Added an automated test checking for RedfishEndpoints of type 'CabinetPDUController'.
  • Added documentation for setting the m001 root password after LiveCD ejection.
  • Fixed an issue with Podman which occurs when using m001 (not in LiveCD) to prepare a reinstall.
  • Fixed an issue where the maximum-paths setting on the spine BGP configuration would get overwritten during install.
  • Removed several unnecessary steps in the install documentation which have since been fixed or further automated.
  • Fixed the snmp-server contact value in the leaf switch configuration.
  • Fixed an issue resulting in VCS hanging after the automated test.
  • Added a small workaround for an intermittent issue with Basecamp not starting correctly on the LiveCD.
  • Added documentation describing how to install the installer documentation onto an NCN.
  • Refactored workarounds into its own RPM and repository enabling independent delivery.
  • Fixed and removed a step in the m001 cluster join documentation.
  • Added the previous version of cray-product-catalog-update image back into our build until other product streams pull it into respective packaging.

0.8.13 - 2021-02-23

  • Added documentation to the install documents to have all the switch commands for checking BGP peering session status.
  • Added documentation to have additional information for spire troubleshooting in the CSM documentation.
  • Added documentation for the correct startup commands in the LiveCD.
  • Added documentation for setting and verifying correct time on the PIT nodes.
  • Fixed Goss tests to remove false positives in CSM v0.8.9.
  • Added functionality to add the pushing of every MAC from every NCN into HSM's EthernetInterfaces table.
  • Added documentation for setting bios time before booting NCNs
  • Added documentation for a work-around for SLS Loader when it is unable to resolve rgw-vip.nmn correctly.
  • Fixed Goss tests to remove false positives in CSM v0.8.11.
  • Fixed an issue where cloudinit on m002 fails with Failed to start Execute cloud user/final scripts.
  • Added documentation for a work-around when storage nodes are not wiping correctly.
  • Added a fix to remove build repositories that were left on booted NCNs.

0.8.12 - 2021-02-22

  • Removed the istio samples directories. They were not needed and were considered a high vulnerability risk.
  • Added a fix for the bond configuration file for quad-management port systems.
  • Added fix when the setting of metal.no-wipe was set to 1, and the disks were not wiped but the etcd voulme was not mounted.
  • Fixed the NCN images to have properly installed yq.
  • Added all applicable MACs to each of the NCN's network interface. This allows an NCN to PXE over any interface included in bond0.
  • Fixed an issue when terminating all the unbound pods at once, it can't resolve registry.local to pull images resulting in IPBO. The fix is to add the registry.local to /etc/hosts.
  • Updated the CSM install document with workflow information clarifying the scenarios for CSM install.
  • Fixed the cloud-init datsource to use the URL for the NMN rather than from DHCP.
  • Fixed an issue where cloud-init configures before it configures its network files. This causes an awkward delay when the CAN is not configured or incorrect values were given.

0.8.11 - 2021-02-20

  • Added documentation for per-live CD software infrastructure dependencies and updated LDAP set-up.
  • Fixed FAS where Slingshot firmware updates failed with "failed verification; unlocking" message.
  • Fixed several Goss tests that were returning false positives.
  • Added documentation to add -nameopt RFC2253 to openssl examples to ensure DN consistency.
  • Added a documentation warning for Gigabyte system users regarding use of C20 version of the firmware.
  • Fixed an issue were Metal iPXE sets dhcp on all management NICs causing long boots.

0.8.10 - 2021-02-19

  • Added a migration script to change ncn network.static_ips to network.netstaticips versions. The migration script will only be used on systems doing a reinstall.
  • Added documentation pointing to troubleshooting tips for pxe boot issues.
  • Added code in static reservation dupe checking logic to filter out mac/ip reservations.
  • Fixed an aruba switch test that was generating a false positive for switch versions in validate --livecd-precheck.
  • Updated documenation to remove a WAR to create static DHCP and DNS entries for UAN CAN interfaces.
  • Separated WARs in CSM into their own repo.
  • Fixed an issue HAXproxy VIP was not being used for the kubeapi endpoint. This could cause resiliency issues.
  • Updated documentation for clarifying context for rebooting the PIT in the kubernetes cluster.
  • Updated documentation to remove confusion about the work-around directory after liveCD reboot.
  • Fixed an issue where BOS helmcahart/docker image versions were not bumped in CSM.
  • Fixed NCNs so they are correctly configured for rasdaemon and rsyslog.
  • Fixed an issue where trim was no longer happening on nodes after discard was disabled.

0.8.9 - 2021-02-18

  • Updated documentation for the format for cabinets.yaml because it did not match the what CSI expects.
  • Added a table of contents to the documentation for the platform install page in the install guide.
  • Fixed customization settings to properly configure istio monitoring applications.
  • Added a fix to configure istio proxy sidecar with ImagePull Policy set to IFNotPresent.
  • Fixed a spire-server issue by increasing the postgres memory limit to 4Gi.
  • Fixed an issue with spire-agent services by increasing the RestartSec interval to prevent request-ncn-join-token job fro trying to configure spire-agent at the same time it was coming up.
  • Added documentation to describe how to avoid weave split-brain from happening.
  • Aded functionality to create a BOS Session Template that is retrievable via craycli.
  • Added a BOS generic template with a new cli option and updated BOS testing.
  • Updated RPM indexes with the latest version of HMS CT tests.
  • Added functionality to install LLDP PTF RPMs. - Part 2.
  • Fixed an issue where the Kubernetes Pods test was failing.
  • Fixed an issue where the per subnets dhcp reservation was missing-in kea.
  • Fixed an issue where there was an XFS forced shutdown encountered by conatinerd and kublet on W001 due to an inability to target the underlying disk.

0.8.8 - 2021-02-17

  • Fixed an issue with the PET tests where they were not checking the correct number of hosts.
  • Fixed an issue where CSM builds were not correctly incrementing the version number.
  • Fixed the documentation for the install verification guide to reorder the tests to match test dependencies.
  • Fixed the SMD CT hardware inventory test to support NVIDIA A100 GPUs and also arbitrary hardware schemas.
  • Added documentation and associated script for checking for minimum firmware levels for PCIe cards along with how to apply firmware to get to minimum levels.
  • Added documentation and a script for a BIOS upgrade to the minimum spcifications for the release.
  • Fixed NCN orchestrated reboots so they don't wipe data.
  • Added functionality to install LLDP PTF RPMs. - Part 1
  • Added documentation to fix a documented work-aound that did not persist Mountain routes across reboots.
  • Added documentation to explicitly state which MACs should be added to ncn_metadata,csv.
  • Added documentation to document a firmware CMOS clear procedure that is needed to allow proper booting using bonded Mellanox network cards on Gigabyte servers.
  • Added several Goss Test fixes
    • Fixed MTU checking prior to reboot and then after reboot.
    • Removed a test for Remote NCN MAC until it can be correctly accessed.
    • Moved spire tests to the run-time test suite so they can be run after the pods are running.
    • Moved the Kubernetes Velero Daily Vault Backup schedule test to run after the pods are running.
    • Moved the Sealed Secrets Key exists test to run after the key exists
    • Moved the Kubernetes Query BSS Cloud-init for spire meta data test to run after BSS is running.
    • Moved the Kubernetes StorageClasses to run after they exist.
  • Added documentation to add a recovery step if spire agent is not running.
  • Updated /srv/cray/scripts/metal/retry-ci.sh script.
  • Reverted a change for metal-ipxe for MLAG fixes that caused CSM to have a bad version of the metal-ipxe RPM.
  • Fixed the incorrect global.appVersion override that was referencing an old build of cray-dns-unbound.
  • Fixed cray-hms-rts because it was missing the correct version of a chart which was causing a failure to deploy.
  • Fixed csm-config-import-0.8.7-5bbh4 and csm-image-recipe-import-0.8.7-kpwx so they don get into a ILBO state.
  • Removed debug code in CSI that was accidentally left in the version in the CSM release.

0.8.7 - 2021-02-16

  • Removed etcd service monitoring for Slingshot controllers - they are no longer using etcd.
  • Fixed the CSM Manifest to have the correct version of Weave
  • Fixed the pyaml security vulnerability in several CMS repositories.
  • Fixed a redis security vulnerability in a container used by the CMS and HMS teams.
  • Fixed the gitea keycloak set-up so that gitea is using the correct version of keyckoak. This was causing a gitea IPBO.
  • Removed a documented work-around for an issue where when Unbound manager restarts Unbound it takes down both instances.
  • Fixed an issue where CSI panics on bad CAN DNS input.
  • Fixed an issue where there was a duplicate static entry when loading SLS network entries.
  • Removed "FixMe" entries in the customization.yaml file
  • Reverted a change that made master and worker nodes unable to access the NMN, HMN, CAN and MTL networks.
  • Reverted a istio back to version v1.18.3 from V1.18.4 because V1.18.4 has a bug causing its deployment to fail.

0.8.6 - 2021-02-15

  • Fixed a PyYaml security vulnerability in cray-sts
  • Fixed a PyYaml security vulnerabulity in postgres-operator and postgres-operator-ui.
  • Updated monitoring alerts for postgress to remove the PostgresqlCommitRateLow alert which was firing too many unneeded alerts.
  • Updated sonar-sync pod resource limits to reduce cpu throttling.
  • Updated Prometheus Node-exporter pod resource limits to reduce cpu throttling.
  • Updated Metalb-speaker pod resource limits to reduce cpu throttling.
  • Fixed a monitoring where a Prometheus rule failure was occurring with sysmgmt-health-cray-sysmgmt-health-postgresql-prometheus-alertt.rules.yaml and PostgresSQL-status.
  • Fixed an issue with kiali which was missing a secret and could not be logged into.
  • Fixed a highly vulnerable container image for boto3.
  • Fixed a highly vulnerable container image for postgresql.
  • Fixed a security issue where non-admin users can get access to web apps through gatekeeper.
  • Removed code for a work-around in a prior CSM release where ncn-m001 fails to join k8s.
  • Fixed an issue with CFS operator where the teardown container was not choosing the proper public ssh key.
  • Updated the RPM indexes with the latest versions of HSM CT tests.
  • Updated documentation and related script in the CSM Install documentation to remove a work-around for resolv,conf being set to 10.92.100.255 in unbound.
  • Updated documentation and related script to remove a work-around for token-certs-refresh.sh script needs to include config parameter.
  • Updated the base pg autoscaling weight for the cephFS data/meta-data pools to address scale issues.
  • Fixed an issue where cloud-init failed to switch NCN BMCs from DHCP to Static.
  • Added documentation to CSM Validation about running automated Goss tests and noting test that have known test issues.
  • Fixed set-dhcp-to-static script which was causing worker nodes to have a new IP after reboot.

0.8.5 - 2021-02-14

  • Updated the UAS Helm chart to handled multiple macvlan Network definitions - Part 2.
  • Released cray-uas-mgr v1.11.6.
  • Added documentation for a woirk around for when ethernet interfaces didn't come up during boot on ncn-w001 and ncn-m002.
  • Updated documentation for network configurations for customers moving from V1.3 to V1.4.
  • Added documentation for a work-around for FIXME and local routes in the customizations.yaml macvlansetup.
  • Added documentation for storage noded cabling into the CSM install documents.
  • Updated V1.4 reinstall instructions with step to pull and tag container image required by shasta-cfg. Also fixed a reference typo.
  • Updated CSM Metal documentation to update incorrect boot times.

0.8.4 - 2021-02-12

  • Updated the UAS Helm chart to handled multiple macvlan Network definitions - Part 1.
  • Added UAS documentation to CSM documents.
  • Added information to the install guide for interpreting CMS Test results.
  • Fixed UAS and WLM maclvlan routes and networks.
  • Added documentation regarding CSI arguments for Mountain Hill configurations.
  • Updated documentation regarding and updated vlan1 magp and ACLS and ACL configuration to fix a failed NCN-m991 reboot away from pit.
  • Added documentation for a Work-Around for when Unbound manager restarts Unbound it takes down both instances.
  • Removed an invalid CSI RPM from the CSM tarball.
  • Fixed an NCN automated testing error where they were querying non-existent nodes.
  • Fixed a kubernetes automated testing issue where master nodes were getting queried on the wrong ports.
  • Fixed csm-image-recipe-import-master so that it no longer has an ImagePullBackOff Error.
  • Added internal use documentation on reconfiguring the bootstrap registry.
  • Fixed UAI pod so that it no longer has an ImagePullBackOff Error.

0.8.3 - 2021-02-11

  • Added documentation to include how to use ldaps.
  • Added functionality for improve exception handling, logging and periodic state syncing for cfs.
  • Fixed builds that were failing due to a new pip release.
  • Added documentation to cover an issue where CSI generated customizations were not being referenced.
  • Added functionality to CSI that changed the ordering of spine, leaf, and aggregation switches. The prior ordering was making upgrading from V1.3 more dificult.
  • Fixed CSI to generate the Shasta V1.3 api-gw url (.local) instead of the V1.5 (.nmn)
  • Fixed Dracut to wait for a MLAG suspension to be released..
  • Reverted a Work-around which changed internal_api to reference Shasta V1.3 DNS.
  • Fixed and issue where the zmq_curve image needs to be cached on pit.
  • Added documentation for for UAN, Worker and Master cabling to the management network.
  • Fixed an issue where image customization was not using the correct key to access the image. The key for the image is now in the hosts file.
  • Fixed an issue where CSM 0.8.2 was picking up the wromg cray-site-init rpm.
  • Fixed an issue where etd cluster restore failed. The totum/curl image is now in the manifest.
  • Added documentation for several changes should be generated by CSI or removed.
  • Updated swith upgrade and rename documents to include aggregation switches.
  • Added documentation for forwardZone in DNS unbound.

0.8.2 - 2021-02-10

  • Fixed an issue where cfs operator. alpine/git, and cray-aee were missing in an offline install.
  • Fixed CSI so that is doesn't generate network configs with overlapping VLANs.
  • Fixed an issue in CSI where it generates an empty site-to-system-lookups in customizatons.yaml
  • Fixed an issue where cray-mes is missing from the docker list.
  • Added a change were internal_api references Shasta V1.3 DNS.

0.8.1 - 2021-02-09

  • Added documentation for a WAR for creating static DHCP and DNS entries for UAN CAN interfaces.
  • Fixed a regression from Shasta V1.3.2 where prometheus alerts were firing incorrectly for kube-controller and kube-scheduler.
  • Fixed an issue where CFS was not able to ssh to an NCN worker node during image customization.
  • Added documentation for clarifying user requirements in UAS/UAI Validation Instructions.
  • Fixed a race condition with cfs-state-reporter accessing DNS host lookup before it is available during installation.
  • Added functionality to stop prescribing IP's for MEDS.
  • Fixed an issue where HPE firmware was not present in the shasta firmware product stream.
  • Fixed an issue in CSI where it had the wrong default for the starting mountain cabinet.
  • Fixed the NCN Chrony config to allow it to accept requests from additional networks
  • Added functionality to basecamp to allow it to identify NCN's by either MAC in the management board.
  • Added documentation to update install documentation based on internal installation validation.
  • Added functionality to apply custom formatting to NCN test output.
  • Added functionality to CSI to generate static routes to the NCN's for Mountain/Hill networks.
  • Fixed vault and spire so they don't immediately ImagePullBackOff errorr.
  • Added missing images that were causing pods to ImagePullBackOff error during installation.
  • Fixed the customization.yaml so that Velero no longer fails with an invalid template error.
  • Fixed a few references to incorrect container images at run-time.
  • Fixed an issue with keycloak-vcs-user-1 pod was not pulling from a specific version.
  • Fixed an issue where CFS could not apply a build configuration for lack of access to the NCN worker.

0.8.0 - 2021-02-07

  • Added documentation on how to prepare a V1.3 system for a V1.4 install.
  • Added information to the CSM installation documentation to cover the csi config init – site-domain and --ntp-pool options.
  • Added information to the CSM installation documentation to cover how to configure discontiguous cabinets.
  • Added improved documentation for LiveCD SETUP Configuration Payload.
  • Added documentation for renaming and renumbering switches when migrating from V1.3 to V1.4
  • Added documentation for LiveCD installationion set a system name variable that is used by several commands.
  • Added documentation to work-around an issue where NCN are getting a DHCP Lease from kea when the BMCs are set to static.This results in multiple IP addresses.
  • Added documentation on how to retrieve customizations.yaml after CSM is installed.
  • Added functionality that allows the firmware directory, and all of its subdirs, to be fetched over HTTP.
  • Added documentation to update install documentation based on internal installation validation.
  • Fixed an issue with root/bin/set-sqfs-links.sh, where it was creating broken symlinks.
  • Fixed an issue where the goss service is not running.
  • Fixed syntax errors in CSM install.sh.
  • Added documentation for reruning install.sh if Nexus has issues during the install.
  • Fixed a typo in dhcp-helper.py that was causing a failure.

0.7.30 - 2021-02-05

  • Added documentation to provide a work-around for PXE unexpected network errors (PXE-E99).
  • Added documentation describing the CMS component validation process.
  • Fixed the HMS CT tests to pull from the hms-pytest image in the NCN's image registry rather than from DTR.
  • Added documentation on steps for rebooting m0001 and having it join the cluster.
  • Fixed functionality to copy the iPXE configuration and NCN boot image into the platforms so they are available for BSS to use to boot NCN's. Added step to set TOKEN up and some minor edits.
  • Fixed an issue where nodes sometimes need to be rebooted to join the cluster.
  • Fixed customizations.yaml to use settings generated by CSI and to persist it to the site-init secretin the lftsman namespace. This allows product installers to read from it.
  • Added documentation for workarounds for m001 during installation.
  • Added documentation to update install documentation based on internal installation validation.
  • Fixed an issue with internally cached docker images to be docker images with default docker.io prefixes. This allows multus to have a proper pre-cached image on air-gapped systems.
  • Added a new tool 'bc' command which is required for the new Goss output parsing scripts.
  • Updated documentation to move install contents to 006.PLATFORM-INSTALL.md for RC1 only.
  • Fixed an issue where container images were trying to be pulled in from external sources in an air-gapped system.
  • Fixed an issue where istio-system containers were not in the CSM release.
  • Add River BIOS/BMC to the liveCD installer.

0.7.29 - 2021-02-04

  • Fixed the configuration for UAN nodes back to expecting a non-bonded interface.
  • Fixed an issue with SMA telemetry by adding a Work-Around back into the release.
  • Added dtr.dev.cray.com/cray/proxyv2:1.6.13-cray1 to the Nexus precache-images config.
  • Increased the Nexus disk size to 1TB.
  • Added documentation for HMS post-install validation.
  • Added HMS CT tests and hms-pytest image to CSM indexes.
  • Added a fix to FAS which had a locks issue when validating Olympus compute blade node BIOS.
  • Added documentation for CSI that states validates --services and --network options no longer exist.
  • Fixed Goss ncn-kubernettes-checks that were broken and reporting 9 failures for each worker node.
  • Added a procedure to configure bootstrap Nexus in a airgap setting.
  • Added new test output formatting to all automated test scripts.
  • Added a fix for install.sh when wait-for-unbound completed too soon.
  • Fixed an issue where rgw-vip.hmn resolves to a VIP that does not exists.
  • Added documentation for shasta.cfg after reflow/add optimization to bypass tracked secrets.
  • Fixed an issue where an incorrect istio version was included in a recent change.
  • Added documentation for adding USB to the prereqs.

0.7.28 - 2021-02-03

  • Fixed istio sidecar and other istio components to remove a sudo security vulnerability.
  • Fixed the NCN Postgres Health check to properly handle a different patroni version than the standard one used in CSM.
  • Fixed CAPMC which was failing on an unexpected datatype, PowerCapacityWatts, from reddfish.
  • Fixed an issue with kubeadm init which was logging warnings on a version validity check that is not used.
  • Fixed CSI to correctly support default subrole mapping in application_node_config.yaml.
  • Fixed Unbound Manager to support a new data format for cabinet networks provided by CSI.
  • Updated install documentation structure and flow.
  • Updated documentation on chain booting and loading cloud-init.
  • Fixed CSI to provided the correct dns-server IP in NMN.conf and HMN.conf.

0.7.27 - 2021-02-02

  • Fixed an issue where cray-import-config was using a docker container from an unstable branch as it was included as part of the product catalog, A formal release docker image is now being used.
  • Fixed an issue where BSS and Master Node 1 have mismatched MAC addresses. BSS now gets the right Ethernet interfaces.
  • Fixed an issue where CSI IP addressing for Hill/Mountain cabinets overlaps some subnets.
  • Fixed an issue where the dns-server IP in NMN.conf, HMN.conf, and MTL.conf is not correct.
  • Updated documentation for switch Metatdata which had several issues with commands and descriptions.
  • Fixed an issue where LiveCD was broken because MTL Network DNS was pointing to unbound.
  • Added functionality where CSI generates the updated UAI macvlan in the customizations.yaml.
  • Added functionality to update kea to load new subnet data for RVR and MTN subnets in SLS.
  • Updated documentation to include command prompt conventions for CSM documentation.
  • Fixed an issue where CSI where an IP from the can-gateway input not being used for networks/CAN.yaml or sls_input_file.json.
  • Fixed an issue where time is inconsistent on NCNs using chronyd.
  • Fixed an issue with CSI hits a panic when mountain-cabinets = 0.
  • Fixed an issue where data generated for sls-input-file.json contained <nils>'s.
  • Fixed an issue in CSI where empty instance ID's were generated in data.json.
  • Fixed an issue in CSI where duplicate VlanIDs were generated for both river and mountain cabinets.

0.7.26 - 2021-02-01

  • Updated documentation to make sure spine, leaf, and CDU switches are using a valid NTP server for V1.4.
  • Fixed several issues with manifestgen to properly merge the customizations and manifest properly.
  • Fixed an error stating "no healthy upstream" when using the prometheus UI after authorizing through keycloak.
  • Fixed an error on CSM import to gitea. Readiness checking was added to make gitea is up and ready before starting the import.
  • Fixed REDS so that it uses the new syslog aggregator DNS name.
  • Fixed the git tag version for customer releases of CSI to remove the "dirty build" designation. That designation is reserved for internal builds.
  • Added documentation defining that peers in the mtlb.yml need to be aggregation switches for some systems.
  • Fixed an issue where the pit.release file is empty in earlier CDM releases.
  • Added checking in spire=update=bss to make sure /params/Global is available before updating bss.
  • Added documentation regarding Mellanox commands for shifting ip-helper from a leaf to a Mellanox spine.
  • Added documentation that describes how to split the shasta-cfg init and encryption steps.
  • Fixed med/reds/rts generators to use hyphens instead of underscores.

0.7.25 - 2021-01-31

  • Added minor changes to the ncnHealthChecks scripts.
  • Added CMS CT Tests to be be part of CSM Validation.
  • Fixed an issue where CFS fails to load inventory during image customization.
  • Refactored shasta-cfg to accommodate bundled CSM releases and simplify update mechanics.
  • Fixed an issue with install.sh which was failing of a check for registry.local and packages.local names.
  • Add shasta.cfg to the CSM media. For a system which has never been installed before, they need the files from STABLE included in the CSM media.
  • Updated the location of switch firmware on LiveCD.

0.7.24 - 2021-01-29

  • Added security vulnerability remediation for PyYAML in node-discovery.
  • Added security vulnerability remediation for PyYAML in keycloak-installer.
  • Fixed an issue where a CSM import Job was running before Gitea was fully up and running. The job now waits for Gitea to be ready.
  • Added an option to CSI to set which switches metalb should peer with.
  • Fixed an issue where Goss livecd-preflight-check was reporting errors due to the goss-k8s-get-nodes test being run before the NCNs are booted.
  • Added CAN networking support in kea to support m001 booting.
  • Fixed an issue where the kubernetes image would not boot in CSM v0.7.23.
  • Fixed an issue where dtr.drv.cray.com/zeromq/zeromq:v4.0.5 was missing in CSM v0.7.22.

0.7.23 - 2021-01-28

  • Fixed cfs-state-reporter to verify that the spire-agent daemon is healthy
  • Updated the NCN build image references to use local copies rather than external ones. The external copies are being brought in and stored locally for better version control and access.
  • Fixed kdump-early and kdump so they can start at boot. This is part one, since more work is needed to make them function properly.
  • Fixed error checking in configure-ntp.sh so that it can detect when basecamp is not returning a payload. This can be one of the contributors to time sync issues.
  • Fixed an issue where HSN nid names were not created due to a move from ansible. After the fix users can look up the nid names
  • Fixed CSI to include a range of IP addresses for Workload Managers (WLMs)
  • Added networking hand-off instructions to make sure the dnsmasq service is turned off after the CSM installer completes
  • Reverted a change to move to dnsmaq rather than resolv.conf for NCN DNS resolution. The change restores the use of resolve.conf. It is being reverted due to DNS latency.

0.7.22 - 2021-01-27

  • Updated the NCN Health Checks documentation.
  • Fixed CSI to map default Hill cabinet IDs to valid VLAN IDs.
  • Fixed an Issue in a recent change that set resolv.conf to unbound 10.92.10.255 in the NCN image.
  • Updated the BGP settings script to support new versions of Aruba firmware.
  • Fixed a issue in CSI which did not define a dns-server in NMN.conf. HMN.conf. and MTL.conf.
  • Fixed an issue in CSI that was setting the wrong secret for SMA.
  • Pinned image versions for cray-etcd-back-up and cray-etcd-defrag to the CSM release.
  • Fixed an issue in dracut where storage nodes were failing to boot because /dev/disk/by-label/SQFSRAID did not exist.
  • Fixed the set-dns-config,sh script to prevent breaking DNS for NCNs.

0.7.21 - 2021-01-26

  • Added the Shasta V1.4 Management Network Installation documentation.
  • Added etcd backups for the Firmware Action Service (FAS).
  • Fixed an issue where UAI classes were not turning off compute networks when they were set to false.
  • Added a script for embedding staging configurations into an ISO to create environment specific ISO's for LiveCD virtual ISO booting.
  • Added a script to automate the process of mounting and booting a LiveCD virtual ISO on HPE ILO machines.
  • Added Several GOSS Tests:
    • Added a test to validate the switchport MTU on pit.
    • Added a test to validate storage K8s configmaps.
    • Added a test to check that all LoadBalancer type k8s services have an external IP.
    • Added a test to validate BGP peers on switches.
    • Added a test to check that dnsmask defined gateways are resolvable.
  • Fixed an issue where no management network interface was created for worker node 3 on a system.
  • Fixed an issue where dhcp-helper.py was not setting alias hostnames for computes and UANs. This was impacting nid DNS reservations.
  • Fixed an issue with getting weave.yml externally. It is now getting it from a local copy.
  • Fixed an issue in SLS, where the wait-for-postgres job was getting an OOM error. Updated the base chart so that the job now has more memory.

0.7.20 - 2021-01-25

  • Added a DNS Tool script that allows management of DNS entries. This allows development team and developers to manage DNS entries during their development.
  • Added a cray-vault label to the cray-vault helm chart to allow vault backups using velero
  • Added velero-restic-restore-helper:v1.5.4 image to enable vault restores using velero
  • Fixed and issue where the HSM Inventory does not show processors or drives. This was blocking WLM functionality.
  • Added a fix for KEA not sending the correct default network to the NCNs.
  • Added an istio-ingress-gateway-local service and reset the externalTrafficPolicy to Cluster on the original service - to give both local and cluster access to the gateway.
  • Added pit.nmn to the cloud init data for correct population of etc/hosts on the NCN's. This is automation to remove a work-around.
  • Set resolv.conf to be unbound - 10.92.100.225 in the NCN image.
  • Added switches to the CSI-generated host records.
  • Updated documentation to remove an error: not ready message on packages.local.
  • Added a fix for Workers and Managers resolving packages.local to different IP addresses.
  • Updated documentation because the location of the CSI RPM changes.
  • Added a fix for an incorrect image being referenced by the Unbound cronjob.
  • Fixed an issue where cms-ipxe-1.4.6 was not being called out in the packaging manifest.
  • Fixed an issue where the hpe=csm-scripts RPM was pulling in a defunct craycli-wrapper RPM.
  • Added improvements to the aruba_set_bgp_peers.py installation script.
    • Added option to pick which switches to set up BGP on.
    • Added input error checking.
    • Added a command to write memory at the end of the scripts.
    • Added a comment at the end of the script to show which switches were updated.

0.7.19 - 2021-01-24

  • Trimmed documentation to remove an unnecessary copy of dnsmasq.d and ephemeral configs.
  • Removed a work-around for adding spire info to the BSS cloud-init metadata.
  • Fixed a missing HSN subnet entry in SLS.
  • Fixed IP addresses lookups to xnames on NCNs.
  • Fixed ipxe operations not being permitted intermittently during booting.
  • Fixed GOSS servers alternating between starting and stopping their service during booting.
  • Fixed a missing spire-0.8.6 helm chart in the CSM ditribution.

0.7.18 - 2021-01-22

  • Added more clean-up from disabling benji functionality - Fixed a build issue related to pulling in upstream operator charts.
  • Fixed an issue preventing disk wipes NCN disks before installation.
  • Fixed an issue where IMS jobs were verifying an SSL certificate for Rados Gateway that is not there.
  • Removed code that automatically disables Redfish Endpoints when they are not available - This was causing nodes to be designated as empty.
  • Converted the taints and labels in the platform-kubernetes ansible playbook into charts and labels to match the V1.4 environment.
  • Added documentation change to show an option for csi config init --hill cabinets.
  • Added documentation change to resolve an install issue where metalb.yml was not found using the install.sh script.
  • Added documentation changes for resetting BMC IP addresses.
  • Fixed tftp access during compute node booting by setting ExternalTraffic Policy to Cluster.
  • Added documentation to add a CSM_RELEASE variable to simplify commands and fix the path to the CSM tarball.
  • Added Spire data to bss. This was missing due to a deployment ordering issue.
  • Added registration of DVS spire data with the Spire Server.

0.7.17 - 2021-01-21

  • Added postgres monitoring for the spire service. This was an addition to the post monitoring for other services that was recently added.
  • Added docs and simple code fixes for switch firmware installation when building the LiveCD.
  • Pinned versions for several RPMs in the CSM distribution.
  • Added a new Goss test for bond0 members on the LiveCD.
  • Fixed LiveCD preflight test requiring a switch_mellanox_password environment variable.
  • Updated Goss Tests with standardized metadata and descriptions.
  • Fixed an issue in the ceph wait-for job that was not properly blocking services from starting before ceph was up and ready.

0.7.16 - 2021-01-20

  • Fixed a missing pit.nmn to the LiveCD cloud-init data so that it is included in /etc/hosts on the NCNs.
  • Fixed the DNS alias rsyslog_agg_service_hmn.local which is required for MEDS to exit.
  • Updated the document to have the HMS SHCD parser look for an absolute address for the SHCD file to prevent the parser from not finding it.

0.7.15 -- 2021-01-20

  • Firmware action Service (FAS) was updated to use a new S3 server to support HPE Firmware updates
  • The CSI version is now pinned in LiveCD
  • Kubernetes Stacked etcd configuration has been reverted back to the Shasta V1.3 external implementation
  • Fixed a disk space issue by direct mounting a partition for kubernetes on worker and master nodes.
  • Updated Weave to v2.8.0 and pinned it to the CSM release.
  • The CSM V0.7.14 CSI change for dnsmasq configs was missing a modification that caused the NMN and HMN to have the wrong dnsnmasq. The modification was added.

0.7.14 - 2021-01-19

  • Removed a plain text password from the file keycloak-user.py. Password is now passed in as a secret through vault.
  • Fixed a crash loop backoff issue with the conman log-forwarding container.
  • Fixed an issue with address pools in the External DNS services due to mismatched a mismatched configuration. This caused the services to have ExternalIP "".
  • Added a Round-Robin DNS Entry to the RADOS gateway to allow for testing of a fix for S3 failures.
  • Added a change to ensure that all of the RPMs used to build the Kubernetes and Storage-Ceph NCN images are included with the CSM release distribution. They are not currently uploaded into a Nexus repository.
  • Updated CSI to use the MLAG VIP IP as the gateway, instead of the PIT node in dnsmasq configs for the MTL, NMN, and NMN networks. This is needed as it helps routing while cloud-init is is getting up and starting during the squashFS boot.
  • When moving from the V1.3 to V1.4 release, switch configurations need updating. Documentation was added to define how to do the updating manually. A script is planned for the future.
  • Fixed a missing binary, lsscsi to the liveCD for SCSI device support.

0.7.13 - 2021-01-15

0.7.12 - 2021-01-14

0.7.11 - 2021-01-13

0.7.10 - 2021-01-12

0.7.9 - 2021-01-11

0.7.8 - 2021-01-08

0.7.7 - 2021-01-07

0.7.6 - 2021-01-06

0.7.5 - 2021-01-05

0.7.4 - 2021-01-03

0.7.3 - 2020-12-18

0.7.2 - 2020-12-17

0.7.1 - 2020-12-16

0.7.0 - 2020-12-15

0.6.2 - 2020-12-11