CASMMON-446

Cray-HPE · Oct 18, 2024 · 81cd2c6 · 81cd2c6
1 parent 547f9b1
commit 81cd2c6
Show file tree

Hide file tree

Showing 12 changed files with 124 additions and 258 deletions.
diff --git a/.spelling b/.spelling
@@ -35,6 +35,13 @@ Vmalert
 Vminsert
 Vmselect
 Vmstorage
+vmstorage
+vmalert
+vmselect
+victoria
+kubectl
+kubelet
+
 
 1.0.x
 1.2.x

diff --git a/operations/README.md b/operations/README.md
@@ -454,7 +454,6 @@ confident that a lack of issues indicates the system is operating normally.
 - [Configure Prometheus Email Alert Notifications](system_management_health/Configure_Prometheus_Email_Alert_Notifications.md)
 - [Grafana Dashboards by Component](system_management_health/Grafana_Dashboards_by_Component.md)
     - [Troubleshoot Grafana Dashboard](system_management_health/Troubleshoot_Grafana_Dashboard.md)
-- [Grafterm](system_management_health/Grafterm.md)
 - [Remove Kiali](system_management_health/Remove_Kiali.md)
 - [`prometheus-kafka-adapter` errors during installation](system_management_health/Prometheus_Kafka_Error.md)
 - [`grok-exporter` errors during installation](system_management_health/Grok-Exporter_Error.md)

diff --git a/operations/kubernetes/Cert_Renewal_for_Kubernetes_and_Bare_Metal_EtcD.md b/operations/kubernetes/Cert_Renewal_for_Kubernetes_and_Bare_Metal_EtcD.md
@@ -565,26 +565,27 @@ Run the following steps from a master node.
    1. Restart Prometheus.
 
       ```bash
-      kubectl rollout restart -n sysmgmt-health statefulSet/prometheus-cray-sysmgmt-health-kube-p-prometheus
-      kubectl rollout status -n sysmgmt-health statefulSet/prometheus-cray-sysmgmt-health-kube-p-prometheus
+      kubectl rollout restart deployment -n sysmgmt-health vmagent-vms-0
+      kubectl rollout status -n sysmgmt-health deployment.apps/vmagent-vms-0
+      kubectl rollout restart deployment -n sysmgmt-health vmagent-vms-1
+      kubectl rollout status -n sysmgmt-health deployment.apps/vmagent-vms-1
       ```
 
       Example output:
 
       ```text
-      Waiting for 1 pods to be ready...
-      statefulset rolling update complete ...
+      deployment "vmagent-vms-0" successfully rolled out
       ```
 
    1. Check for any `tls` errors from the active Prometheus targets. No errors are expected.
 
       ```bash
-      PROM_IP=$(kubectl get services -n sysmgmt-health cray-sysmgmt-health-kube-p-prometheus -o json | jq -r '.spec.clusterIP')
-      curl -s http://${PROM_IP}:9090/api/v1/targets | jq -r '.data.activeTargets[] | select(."scrapePool" == "sysmgmt-health/cray-sysmgmt-health-kube-p-kube-etcd/0")' | grep lastError | sort -u
+      PROM_IP=$(kubectl get services -n sysmgmt-health vmagent-vms -o json | jq -r '.spec.clusterIP')
+      curl -s http://${PROM_IP}:8429/targets |  grep kube-etcd | sort -u 
       ```
 
       Example output:
 
       ```text
-        "lastError": "",
+        state=up, endpoint=https://10.252.1.10:2379/metrics, labels={endpoint="http-metrics",instance="10.252.1.10:2379",job="kube-etcd",namespace="kube-system",service="vms-kube-etcd"}, scrapes_total=28114, scrapes_failed=0, last_scrape=14838ms ago, scrape_duration=14ms, samples_scraped=1487, error=
       ```
diff --git a/operations/network/dns/PowerDNS_Configuration.md b/operations/network/dns/PowerDNS_Configuration.md
@@ -4,7 +4,7 @@
 
 PowerDNS replaces the CoreDNS server that earlier versions of CSM used to provide External DNS services.
 
-The `cray-dns-powerdns-can-tcp` and `cray-dns-powerdns-can-udp` LoadBalancer resources are configured to service external DNS requests using the IP address specified by the CSI `--cmn-external-dns` command line argument.
+The `cray-dns-powerdns-can-tcp` and `cray-dns-powerdns-can-udp` `LoadBalancer` resources are configured to service external DNS requests using the IP address specified by the CSI `--cmn-external-dns` command line argument.
 
 The CSI `--system-name` and `--site-domain` command line arguments are combined to form the subdomain used for External DNS.
 
@@ -134,7 +134,7 @@ zone "8.101.10.in-addr.arpa" {
 
 The CSM implementation of PowerDNS supports the DNS Security Extensions (DNSSEC) and the signing of zones with a user-supplied zone signing key.
 
-If DNSSEC is to be used for zone transfer then the `dnssec` SealedSecret in `customizations.yaml` should be updated to include a base64 encoded version of the private key portion of the desired zone signing key.
+If DNSSEC is to be used for zone transfer then the `dnssec` SealedSecret in `customizations.yaml` should be updated to include a `base64` encoded version of the private key portion of the desired zone signing key.
 
 Here is an example of a zone signing key.
 
@@ -221,7 +221,7 @@ spec:
                   key:       dnFC5euKixIKXAr6sZhI7kVQbQCXoDG5R5eHSYZiBxY=
 ```
 
-> **`IMPORTANT`** The key used for TSIG **must** have `.tsig` in the name and unlike the zone signing key it should not be base64 encoded.
+> **`IMPORTANT`** The key used for TSIG **must** have `.tsig` in the name and unlike the zone signing key it should not be `base64` encoded.
 
 #### Example configuration for BIND
 

diff --git a/...ns/network/external_dns/External_DNS_Failing_to_Discover_Services_Workaround.md b/...ns/network/external_dns/External_DNS_Failing_to_Discover_Services_Workaround.md
@@ -30,7 +30,7 @@ Use this procedure to resolve any external DNS routing issues with backend servi
     services         sma-kibana                        [services-gateway]             [sma-kibana.cmn.SYSTEM_DOMAIN_NAME]                            2d16h
     sysmgmt-health   cray-sysmgmt-health-alertmanager  [services/services-gateway]    [alertmanager.cmn.SYSTEM_DOMAIN_NAME]                          2d16h
     sysmgmt-health   cray-sysmgmt-health-grafana       [services/services-gateway]    [grafana.cmn.SYSTEM_DOMAIN_NAME]                               2d16h
-    sysmgmt-health   cray-sysmgmt-health-prometheus    [services/services-gateway]    [prometheus.cmn.SYSTEM_DOMAIN_NAME]                            2d16h
+    sysmgmt-health   cray-sysmgmt-health-vm-select     [services/services-gateway]    [vmselect.cmn.SYSTEM_DOMAIN_NAME]                              2d16h
     ```
 
 1. (`ncn-mw#`) Inspect the `VirtualService` objects to learn the destination service and port.
@@ -47,37 +47,41 @@ Use this procedure to resolve any external DNS routing issues with backend servi
     apiVersion: networking.istio.io/v1beta1
     kind: VirtualService
     metadata:
-      creationTimestamp: "2020-07-09T17:49:07Z"
-      generation: 1
-      labels:
-        app: cray-sysmgmt-health-prometheus
-        app.kubernetes.io/instance: cray-sysmgmt-health
-        app.kubernetes.io/managed-by: Tiller
-        app.kubernetes.io/name: cray-sysmgmt-health
-        app.kubernetes.io/version: 8.15.4
-        helm.sh/chart: cray-sysmgmt-health-0.3.1
-      name: cray-sysmgmt-health-prometheus
-      namespace: sysmgmt-health
-      resourceVersion: "41620"
-      selfLink: /apis/networking.istio.io/v1beta1/namespaces/sysmgmt-health/virtualservices/cray-sysmgmt-health-prometheus
-      uid: d239dfcc-a827-4a51-9b73-6eccfb937088
-    spec:
-      gateways:
-      - services/services-gateway
-      hosts:
-      - prometheus.cmn.SYSTEM_DOMAIN_NAME
+      annotations:
+      meta.helm.sh/release-name: cray-sysmgmt-health
+      meta.helm.sh/release-namespace: sysmgmt-health
+    creationTimestamp: "2024-10-15T12:59:14Z"
+    generation: 1
+    labels:
+      app: cray-sysmgmt-health-vm-select
+      app.kubernetes.io/instance: cray-sysmgmt-health
+      app.kubernetes.io/managed-by: Helm
+      app.kubernetes.io/name: cray-sysmgmt-health
+      app.kubernetes.io/version: 0.17.5
+      helm.sh/chart: cray-sysmgmt-health-1.0.17-20241016103148_b40f1aa
+    name: cray-sysmgmt-health-vm-select
+    namespace: sysmgmt-health
+    resourceVersion: "149049132"
+    uid: d166065d-1b3b-4434-b25b-e95cb8940b01
+   spec:
+     gateways:
+     - services/services-gateway
+     - services/customer-admin-gateway
+     hosts:
+      - vmselect.cmn.mug.hpc.amslabs.hpecorp.net
       http:
       - match:
-        - authority:
-            exact: prometheus.cmn.SYSTEM_DOMAIN_NAME
-        route:
-        - destination:
-            host: cray-sysmgmt-health-kube-p-prometheus
-            port:
-              number: 9090
+      - authority:
+        exact: vmselect.cmn.mug.hpc.amslabs.hpecorp.net
+     route:
+       - destination:
+          host: vmselect-vms
+          port:
+            number: 8481
+
     ```
 
-    From the `VirtualService data`, it is straightforward to see how traffic will be routed. In this example, connections to `prometheus.cmn.SYSTEM_DOMAIN_NAME` will be routed to the
-    `cray-sysmgmt-health-prometheus` service in the `sysmgmt-health` namespace on port 9090.
+    From the `VirtualService data`, it is straightforward to see how traffic will be routed. In this example, connections to `vmselect.cmn.SYSTEM_DOMAIN_NAME` will be routed to the
+    `cray-sysmgmt-health-prometheus` service in the `sysmgmt-health` namespace on port 8481.
 
 External DNS will now be connected to the backend service.
diff --git a/...external_dns/Troubleshoot_Systems_Not_Provisioned_with_External_IP_Addresses.md b/...external_dns/Troubleshoot_Systems_Not_Provisioned_with_External_IP_Addresses.md
@@ -33,7 +33,7 @@ The Customer Management Network \(CMN\) is not supported on the system.
     services         sma-kibana                        [services-gateway]             [sma-kibana.cmn.SYSTEM_DOMAIN_NAME]                            2d16h
     sysmgmt-health   cray-sysmgmt-health-alertmanager  [services/services-gateway]    [alertmanager.cmn.SYSTEM_DOMAIN_NAME]                          2d16h
     sysmgmt-health   cray-sysmgmt-health-grafana       [services/services-gateway]    [grafana.cmn.SYSTEM_DOMAIN_NAME]                               2d16h
-    sysmgmt-health   cray-sysmgmt-health-prometheus    [services/services-gateway]    [prometheus.cmn.SYSTEM_DOMAIN_NAME]                            2d16h
+    sysmgmt-health   cray-sysmgmt-health-prometheus    [services/services-gateway]    [vmselect.cmn.SYSTEM_DOMAIN_NAME]                            2d16h
     ```
 
 2. Lookup the cluster IP and port for service.
@@ -48,7 +48,7 @@ The Customer Management Network \(CMN\) is not supported on the system.
 
     ```console
     NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
-    cray-sysmgmt-health-kube-p-prometheus   ClusterIP   10.25.124.159   <none>        9090/TCP   23h
+    cray-sysmgmt-health-grafana   ClusterIP   10.25.124.159   <none>        9090/TCP   23h
     ```
 
 3. Setup port forwarding from a laptop or workstation to access the service.
@@ -62,3 +62,36 @@ The Customer Management Network \(CMN\) is not supported on the system.
     ```
 
 4. Visit `http://localhost:9090/` in a laptop or workstation browser.
+
+5. There is no `clusterip` for vmselect due to headless service
+   Below are the steps to access headless service
+
+  a) Lookup the service  and port for vmselect service
+   The example below is for the `vmselect-vms` service.
+
+    ```bash
+    kubectl -n sysmgmt-health get service vmselect-vms
+    ```
+
+    Example output:
+
+    ```console
+    NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
+    vmselect-vms                             ClusterIP   None         <none>        8481/TCP   14d
+    ```
+   use kubectl port-forward to connect to a vmselect server running in a Kubernetes cluster
+   ```bash
+   kubectl port-forward -n sysmgmt-health service/vmselect-vms 8082:8481
+  ```
+
+  Setup port forwarding from a laptop or workstation to access the service.
+
+    Use the cluster IP and port for the service obtained in the previous step. If the port is unprivileged, use the same port number on the local side.
+
+    Replace the cluster IP, port, and system name values in the example below.
+
+    ```bash
+    # ssh -L 9090:10.25.124.159:8082 root@SYSTEM_NCN_DOMAIN_NAME
+    ```
+
+  b Visit `http://localhost:9090/` in a laptop or workstation browser.
diff --git a/operations/system_management_health/Access_System_Management_Health_Services.md b/operations/system_management_health/Access_System_Management_Health_Services.md
@@ -44,34 +44,33 @@ When accessing the URLs listed below, it will be necessary to accept one or more
 logging in. The details of the security warning will indicate that a self-signed certificate/unknown issuer is being used for the site. Support for incorporation of certificates from Trusted Certificate
 Authorities is planned for a future release.
 
-### Prometheus
+### VictoriaMetrics UI
 
-URL: `https://prometheus.cmn.SYSTEM_DOMAIN_NAME/`
+URL: `https://vmselect..cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmui`
 
-Central Prometheus instance scrapes metrics from Kubernetes, Ceph, and the hosts (part of `kube-prometheus-stack` Helm chart).
+Vmagent instance scrapes metrics from Kubernetes, Ceph, and the hosts (part of ` victoria-metrics-k8s-stack` Helm chart).
 
-Prometheus generates alerts based on metrics and reports them to the Alertmanager. The 'Alerts' link at the top of the page will show all of the inactive, pending, and firing alerts on the system.
+Victoria metrics generates alerts based on metrics and reports them to the Alertmanager. The 'Alerts' link at the top of the page will show all of the inactive, pending, and firing alerts on the system.
 Clicking on any of the alerts will expand them, enabling users to use the 'Labels' data to discern the details of the alert. The details will also show the state of the alert, how long it has been
 active, and the value for the alert.
 
-For more information regarding the use of the Prometheus interface, see
-[Getting Started/](https://prometheus.io/docs/prometheus/latest/getting_started/) in the Prometheus online documentation.
+For more information regarding the use of the victoria metrics interface, see
+[Getting Started/](https://docs.victoriametrics.com/) in the victoria metrics online documentation.
 
 Some alerts may be falsely triggered. This occurs if they are alerts which will be improved in the future, or if they are alerts impacted by whether all software products have been installed yet.
 See [Troubleshoot Prometheus Alerts](Troubleshoot_Prometheus_Alerts.md).
 
-### Thanos
+### VMalert
+
+URL: `https://vmselect.cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmalert/`
+VMAlert - executes a list of given alerting or recording rules against configured address.
 
-URL: `https://thanos.cmn.SYSTEM_DOMAIN_NAME/`
+The VMAlert CRD declaratively defines a desired VMAlert setup to run in a Kubernetes cluster.
 
-Thanos is a set of components that can be composed into a highly available, multi Prometheus metric system with potentially unlimited storage capacity, if your Object Storage allows for it.
-It leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric data in any object storage while retaining fast query latencies.
-Additionally, it provides a global query view across all Prometheus installations and can merge data from Prometheus HA pairs.
+It has few required config options - `datasource` and notifier are required, for other config parameters check doc.
 
-For more information regarding the use of the Thanos interface, see
-[Getting Started/](https://thanos.io/tip/thanos/getting-started.md/) in the thanos online documentation.
+For each VMAlert resource, the Operator deploys a properly configured Deployment in the same namespace. The VMAlert Pods are configured to mount a list of `Configmaps` prefixed with <VMAlert-name>-number containing the configuration for alerting rules.
 
-### Alertmanager
 
 URL: `https://alertmanager.cmn.SYSTEM_DOMAIN_NAME/`
 

diff --git a/...ons/system_management_health/Configure_Prometheus_Alerta_Alert_Notifications.md b/...ons/system_management_health/Configure_Prometheus_Alerta_Alert_Notifications.md