This procedure will install CSM applications and services into the CSM Kubernetes cluster.
Node: Check the information in Known Issues before starting this procedure to be warned about possible problems.
- Initialize Bootstrap Registry
- Create Site-Init Secret
- Deploy Sealed Secret Decryption Key
- Deploy CSM Applications and Services
- Setup Nexus
- Set NCNs to use Unbound
- Apply Pod Priorities
- Apply After Sysmgmt Manifest Workarounds
- Known Issues
- Next Topic
NOTE
The bootstrap registry runs in a default Nexus configuration, which is started and populated in this section. It only exists during initial CSM install on the PIT node in order to bootstrap CSM services. Once CSM install is completed and the PIT node is rebooted as an NCN, the bootstrap Nexus no longer exists.
-
Verify that Nexus is running:
pit# systemctl status nexus
-
Verify that Nexus is ready. (Any HTTP response other than
200 OK
indicates Nexus is not ready.)pit# curl -sSif http://localhost:8081/service/rest/v1/status/writable
Expected output looks similar to the following:
HTTP/1.1 200 OK Date: Thu, 04 Feb 2021 05:27:44 GMT Server: Nexus/3.25.0-03 (OSS) X-Content-Type-Options: nosniff Content-Length: 0
-
Load the skopeo image installed by the cray-nexus RPM:
pit# podman load -i /var/lib/cray/container-images/cray-nexus/skopeo-stable.tar quay.io/skopeo/stable
-
Use
skopeo sync
to upload container images from the CSM release:pit# export CSM_RELEASE=csm-x.y.z pit# podman run --rm --network host -v /var/www/ephemeral/${CSM_RELEASE}/docker/dtr.dev.cray.com:/images:ro quay.io/skopeo/stable sync \ --scoped --src dir --dest docker --dest-tls-verify=false --dest-creds admin:admin123 /images localhost:5000
NOTE
As the bootstrap Nexus uses the default configuration, the above command uses the default admin credentials (admin
user with passwordadmin123
) in order to upload to the bootstrap registry, which is listening on localhost:5000.
The site-init
secret in the loftsman
namespace makes
/var/www/ephemeral/prep/site-init/customizations.yaml
available to product
installers. The site-init
secret should only be updated when the
corresponding customizations.yaml
data is changed, such as during system
installation or upgrade. Create the site-init
secret to contain
/var/www/ephemeral/prep/site-init/customizations.yaml
:
pit# kubectl create secret -n loftsman generic site-init --from-file=/var/www/ephemeral/prep/site-init/customizations.yaml
Expected output looks similar to the following:
secret/site-init created
NOTE
If thesite-init
secret already exists thenkubectl
will error with a message similar to:Error from server (AlreadyExists): secrets "site-init" already exists
In this case, delete the
site-init
secret and recreate it.
First delete it:
pit# kubectl delete secret -n loftsman site-initExpected output looks similar to the following:
secret "site-init" deleted
Then recreate it:
pit# kubectl create secret -n loftsman generic site-init --from-file=/var/www/ephemeral/prep/site-init/customizations.yamlExpected output looks similar to the following:
secret/site-init created
WARNING
If for some reason the system customizations need to be modified to complete product installation, administrators must first updatecustomizations.yaml
in thesite-init
Git repository, which may no longer be mounted on any cluster node, and then delete and recreate thesite-init
secret as shown below.To read
customizations.yaml
from thesite-init
secret:ncn# kubectl get secrets -n loftsman site-init -o jsonpath='{.data.customizations\.yaml}' | base64 -d > customizations.yamlTo delete the
site-init
secret:ncn# kubectl -n loftsman delete secret site-initTo recreate the
site-init
secret:ncn# kubectl create secret -n loftsman generic site-init --from-file=customizations.yaml
Deploy the corresponding key necessary to decrypt sealed secrets:
pit# /var/www/ephemeral/prep/site-init/deploy/deploydecryptionkey.sh
An error similar to the following may occur when deploying the key:
Error from server (NotFound): secrets "sealed-secrets-key" not found
W0304 17:21:42.749101 29066 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
secret/sealed-secrets-key created
Restarting sealed-secrets to pick up new keys
No resources found
This is expected and can safely be ignored.
Run install.sh
to deploy CSM applications services. This command may take 25 minutes or more to run.
NOTE
install.sh
requires various system configuration which are expected to be found in the locations used in proceeding documentation; however, it needs to knowSYSTEM_NAME
in order to findmetallb.yaml
andsls_input_file.json
configuration files.Some commands will also need to have the CSM_RELEASE variable set.
Verify that the
SYSTEM_NAME
andCSM_RELEASE
environment variables are set:pit# echo $SYSTEM_NAME pit# echo $CSM_RELEASEIf they are not set, perform the following:
pit# export SYSTEM_NAME=eniac pit# export CSM_RELEASE=csm-x.y.z
pit# cd /var/www/ephemeral/$CSM_RELEASE
pit# ./install.sh
On success, install.sh
will output OK
to stderr and exit with status code
0
, e.g.:
pit# ./install.sh
...
+ CSM applications and services deployed
install.sh: OK
In the event that install.sh
does not complete successfully, consult the
known issues below to resolve potential problems and then try
running install.sh
again.
IMPORTANT: If you have to re-run install.sh to re-deploy failed ceph-csi provisioners you must make sure to delete the jobs that have not completed. These are left there for investigation on failure. They are automatically removed on a successful deployment.
pit# kubectl get jobs
NAME COMPLETIONS DURATION AGE
cray-ceph-csi-cephfs 0/1 3m35s
cray-ceph-csi-rbd 0/1 8m36s
If these jobs exist then
kubectl delete job <jobname>
before running install.sh again.
Run ./lib/setup-nexus.sh
to configure Nexus and upload CSM RPM repositories,
container images, and Helm charts. This command may take 20 minutes or more to run.
pit# ./lib/setup-nexus.sh
On success, setup-nexus.sh
will output to OK
on stderr and exit with status
code 0
, e.g.:
pit# ./lib/setup-nexus.sh
...
+ Nexus setup complete
setup-nexus.sh: OK
In the event of an error, consult the known issues below to
resolve potential problems and then try running setup-nexus.sh
again. Note
that subsequent runs of setup-nexus.sh
may report FAIL
when uploading
duplicate assets. This is ok as long as setup-nexus.sh
outputs
setup-nexus.sh: OK
and exits with status code 0
.
First, verify that SLS properly reports all management NCNs in the system:
pit# ./lib/list-ncns.sh
On success, each management NCN will be output, e.g.:
pit# ./lib/list-ncns.sh
+ Getting admin-client-auth secret
+ Obtaining access token
+ Querying SLS
ncn-m001
ncn-m002
ncn-m003
ncn-s001
ncn-s002
ncn-s003
ncn-w001
ncn-w002
ncn-w003
If any management NCNs are missing from the output, take corrective action before proceeding.
Next, run lib/set-ncns-to-unbound.sh
to SSH to each management NCN and update
/etc/resolv.conf to use Unbound as the nameserver.
pit# ./lib/set-ncns-to-unbound.sh
NOTE
If passwordless SSH is not configured, the administrator will have to enter the corresponding password as the script attempts to connect to each NCN.
On success, the nameserver configuration in /etc/resolv.conf will be printed for each management NCN, e.g.,:
pit# ./lib/set-ncns-to-unbound.sh
+ Getting admin-client-auth secret
+ Obtaining access token
+ Querying SLS
+ Updating ncn-m001
Password:
ncn-m001: nameserver 127.0.0.1
ncn-m001: nameserver 10.92.100.225
+ Updating ncn-m002
Password:
ncn-m002: nameserver 10.92.100.225
+ Updating ncn-m003
Password:
ncn-m003: nameserver 10.92.100.225
+ Updating ncn-s001
Password:
ncn-s001: nameserver 10.92.100.225
+ Updating ncn-s002
Password:
ncn-s002: nameserver 10.92.100.225
+ Updating ncn-s003
Password:
ncn-s003: nameserver 10.92.100.225
+ Updating ncn-w001
Password:
ncn-w001: nameserver 10.92.100.225
+ Updating ncn-w002
Password:
ncn-w002: nameserver 10.92.100.225
+ Updating ncn-w003
Password:
ncn-w003: nameserver 10.92.100.225
NOTE
The script connects to ncn-m001 which will be the PIT node, whose password may be different from that of the other NCNs.
Run the add_pod_priority.sh
script to create and apply a pod priority class to services critical to CSM. This will give these services a higher priority than others to ensure they get scheduled by Kubernetes in the event that resources are limited on smaller deployments.
pit# /usr/share/doc/csm/upgrade/1.0/scripts/upgrade/add_pod_priority.sh
Creating csm-high-priority-service pod priority class
priorityclass.scheduling.k8s.io/csm-high-priority-service configured
Patching cray-postgres-operator deployment in services namespace
deployment.apps/cray-postgres-operator patched
Patching cray-postgres-operator-postgres-operator-ui deployment in services namespace
deployment.apps/cray-postgres-operator-postgres-operator-ui patched
Patching istio-operator deployment in istio-operator namespace
deployment.apps/istio-operator patched
Patching istio-ingressgateway deployment in istio-system namespace
deployment.apps/istio-ingressgateway patched
.
.
.
After running the add_pod_priority.sh
script, the affected pods will be restarted as the pod priority class is applied to them.
Follow the workaround instructions for the after-sysmgmt-manifest
breakpoint.
The install.sh
script changes cluster state and should not simply be rerun
in the event of a failure without careful consideration of the specific
error. It may be possible to resume installation from the last successful
command executed by install.sh
, but administrators will need to appropriately
modify install.sh
to pick up where the previous run left off. (Note: The
install.sh
script runs with set -x
, so each command will be printed to
stderr prefixed with the expanded value of PS4, namely, +
.)
Known potential issues with suggested fixes are listed in Troubleshoot Nexus.
After completing this procedure the next step is to redeploy the PIT node.