Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
rambabubolla committed Oct 18, 2024
1 parent 547f9b1 commit 81cd2c6
Show file tree
Hide file tree
Showing 12 changed files with 124 additions and 258 deletions.
7 changes: 7 additions & 0 deletions .spelling
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,13 @@ Vmalert
Vminsert
Vmselect
Vmstorage
vmstorage
vmalert
vmselect
victoria
kubectl
kubelet


1.0.x
1.2.x
Expand Down
1 change: 0 additions & 1 deletion operations/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -454,7 +454,6 @@ confident that a lack of issues indicates the system is operating normally.
- [Configure Prometheus Email Alert Notifications](system_management_health/Configure_Prometheus_Email_Alert_Notifications.md)
- [Grafana Dashboards by Component](system_management_health/Grafana_Dashboards_by_Component.md)
- [Troubleshoot Grafana Dashboard](system_management_health/Troubleshoot_Grafana_Dashboard.md)
- [Grafterm](system_management_health/Grafterm.md)
- [Remove Kiali](system_management_health/Remove_Kiali.md)
- [`prometheus-kafka-adapter` errors during installation](system_management_health/Prometheus_Kafka_Error.md)
- [`grok-exporter` errors during installation](system_management_health/Grok-Exporter_Error.md)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -565,26 +565,27 @@ Run the following steps from a master node.
1. Restart Prometheus.

```bash
kubectl rollout restart -n sysmgmt-health statefulSet/prometheus-cray-sysmgmt-health-kube-p-prometheus
kubectl rollout status -n sysmgmt-health statefulSet/prometheus-cray-sysmgmt-health-kube-p-prometheus
kubectl rollout restart deployment -n sysmgmt-health vmagent-vms-0
kubectl rollout status -n sysmgmt-health deployment.apps/vmagent-vms-0
kubectl rollout restart deployment -n sysmgmt-health vmagent-vms-1
kubectl rollout status -n sysmgmt-health deployment.apps/vmagent-vms-1
```

Example output:

```text
Waiting for 1 pods to be ready...
statefulset rolling update complete ...
deployment "vmagent-vms-0" successfully rolled out
```

1. Check for any `tls` errors from the active Prometheus targets. No errors are expected.

```bash
PROM_IP=$(kubectl get services -n sysmgmt-health cray-sysmgmt-health-kube-p-prometheus -o json | jq -r '.spec.clusterIP')
curl -s http://${PROM_IP}:9090/api/v1/targets | jq -r '.data.activeTargets[] | select(."scrapePool" == "sysmgmt-health/cray-sysmgmt-health-kube-p-kube-etcd/0")' | grep lastError | sort -u
PROM_IP=$(kubectl get services -n sysmgmt-health vmagent-vms -o json | jq -r '.spec.clusterIP')
curl -s http://${PROM_IP}:8429/targets | grep kube-etcd | sort -u
```

Example output:

```text
"lastError": "",
state=up, endpoint=https://10.252.1.10:2379/metrics, labels={endpoint="http-metrics",instance="10.252.1.10:2379",job="kube-etcd",namespace="kube-system",service="vms-kube-etcd"}, scrapes_total=28114, scrapes_failed=0, last_scrape=14838ms ago, scrape_duration=14ms, samples_scraped=1487, error=
```
6 changes: 3 additions & 3 deletions operations/network/dns/PowerDNS_Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

PowerDNS replaces the CoreDNS server that earlier versions of CSM used to provide External DNS services.

The `cray-dns-powerdns-can-tcp` and `cray-dns-powerdns-can-udp` LoadBalancer resources are configured to service external DNS requests using the IP address specified by the CSI `--cmn-external-dns` command line argument.
The `cray-dns-powerdns-can-tcp` and `cray-dns-powerdns-can-udp` `LoadBalancer` resources are configured to service external DNS requests using the IP address specified by the CSI `--cmn-external-dns` command line argument.

The CSI `--system-name` and `--site-domain` command line arguments are combined to form the subdomain used for External DNS.

Expand Down Expand Up @@ -134,7 +134,7 @@ zone "8.101.10.in-addr.arpa" {

The CSM implementation of PowerDNS supports the DNS Security Extensions (DNSSEC) and the signing of zones with a user-supplied zone signing key.

If DNSSEC is to be used for zone transfer then the `dnssec` SealedSecret in `customizations.yaml` should be updated to include a base64 encoded version of the private key portion of the desired zone signing key.
If DNSSEC is to be used for zone transfer then the `dnssec` SealedSecret in `customizations.yaml` should be updated to include a `base64` encoded version of the private key portion of the desired zone signing key.

Here is an example of a zone signing key.

Expand Down Expand Up @@ -221,7 +221,7 @@ spec:
key: dnFC5euKixIKXAr6sZhI7kVQbQCXoDG5R5eHSYZiBxY=
```

> **`IMPORTANT`** The key used for TSIG **must** have `.tsig` in the name and unlike the zone signing key it should not be base64 encoded.
> **`IMPORTANT`** The key used for TSIG **must** have `.tsig` in the name and unlike the zone signing key it should not be `base64` encoded.
#### Example configuration for BIND

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Use this procedure to resolve any external DNS routing issues with backend servi
services sma-kibana [services-gateway] [sma-kibana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-alertmanager [services/services-gateway] [alertmanager.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-grafana [services/services-gateway] [grafana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-prometheus [services/services-gateway] [prometheus.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-vm-select [services/services-gateway] [vmselect.cmn.SYSTEM_DOMAIN_NAME] 2d16h
```

1. (`ncn-mw#`) Inspect the `VirtualService` objects to learn the destination service and port.
Expand All @@ -47,37 +47,41 @@ Use this procedure to resolve any external DNS routing issues with backend servi
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
creationTimestamp: "2020-07-09T17:49:07Z"
generation: 1
labels:
app: cray-sysmgmt-health-prometheus
app.kubernetes.io/instance: cray-sysmgmt-health
app.kubernetes.io/managed-by: Tiller
app.kubernetes.io/name: cray-sysmgmt-health
app.kubernetes.io/version: 8.15.4
helm.sh/chart: cray-sysmgmt-health-0.3.1
name: cray-sysmgmt-health-prometheus
namespace: sysmgmt-health
resourceVersion: "41620"
selfLink: /apis/networking.istio.io/v1beta1/namespaces/sysmgmt-health/virtualservices/cray-sysmgmt-health-prometheus
uid: d239dfcc-a827-4a51-9b73-6eccfb937088
spec:
gateways:
- services/services-gateway
hosts:
- prometheus.cmn.SYSTEM_DOMAIN_NAME
annotations:
meta.helm.sh/release-name: cray-sysmgmt-health
meta.helm.sh/release-namespace: sysmgmt-health
creationTimestamp: "2024-10-15T12:59:14Z"
generation: 1
labels:
app: cray-sysmgmt-health-vm-select
app.kubernetes.io/instance: cray-sysmgmt-health
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: cray-sysmgmt-health
app.kubernetes.io/version: 0.17.5
helm.sh/chart: cray-sysmgmt-health-1.0.17-20241016103148_b40f1aa
name: cray-sysmgmt-health-vm-select
namespace: sysmgmt-health
resourceVersion: "149049132"
uid: d166065d-1b3b-4434-b25b-e95cb8940b01
spec:
gateways:
- services/services-gateway
- services/customer-admin-gateway
hosts:
- vmselect.cmn.mug.hpc.amslabs.hpecorp.net
http:
- match:
- authority:
exact: prometheus.cmn.SYSTEM_DOMAIN_NAME
route:
- destination:
host: cray-sysmgmt-health-kube-p-prometheus
port:
number: 9090
- authority:
exact: vmselect.cmn.mug.hpc.amslabs.hpecorp.net
route:
- destination:
host: vmselect-vms
port:
number: 8481
```

From the `VirtualService data`, it is straightforward to see how traffic will be routed. In this example, connections to `prometheus.cmn.SYSTEM_DOMAIN_NAME` will be routed to the
`cray-sysmgmt-health-prometheus` service in the `sysmgmt-health` namespace on port 9090.
From the `VirtualService data`, it is straightforward to see how traffic will be routed. In this example, connections to `vmselect.cmn.SYSTEM_DOMAIN_NAME` will be routed to the
`cray-sysmgmt-health-prometheus` service in the `sysmgmt-health` namespace on port 8481.

External DNS will now be connected to the backend service.
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ The Customer Management Network \(CMN\) is not supported on the system.
services sma-kibana [services-gateway] [sma-kibana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-alertmanager [services/services-gateway] [alertmanager.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-grafana [services/services-gateway] [grafana.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-prometheus [services/services-gateway] [prometheus.cmn.SYSTEM_DOMAIN_NAME] 2d16h
sysmgmt-health cray-sysmgmt-health-prometheus [services/services-gateway] [vmselect.cmn.SYSTEM_DOMAIN_NAME] 2d16h
```

2. Lookup the cluster IP and port for service.
Expand All @@ -48,7 +48,7 @@ The Customer Management Network \(CMN\) is not supported on the system.

```console
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cray-sysmgmt-health-kube-p-prometheus ClusterIP 10.25.124.159 <none> 9090/TCP 23h
cray-sysmgmt-health-grafana ClusterIP 10.25.124.159 <none> 9090/TCP 23h
```

3. Setup port forwarding from a laptop or workstation to access the service.
Expand All @@ -62,3 +62,36 @@ The Customer Management Network \(CMN\) is not supported on the system.
```

4. Visit `http://localhost:9090/` in a laptop or workstation browser.

5. There is no `clusterip` for vmselect due to headless service
Below are the steps to access headless service

a) Lookup the service and port for vmselect service
The example below is for the `vmselect-vms` service.

```bash
kubectl -n sysmgmt-health get service vmselect-vms
```

Example output:

```console
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
vmselect-vms ClusterIP None <none> 8481/TCP 14d
```
use kubectl port-forward to connect to a vmselect server running in a Kubernetes cluster
```bash
kubectl port-forward -n sysmgmt-health service/vmselect-vms 8082:8481
```

Setup port forwarding from a laptop or workstation to access the service.

Use the cluster IP and port for the service obtained in the previous step. If the port is unprivileged, use the same port number on the local side.

Replace the cluster IP, port, and system name values in the example below.

```bash
# ssh -L 9090:10.25.124.159:8082 root@SYSTEM_NCN_DOMAIN_NAME
```

b Visit `http://localhost:9090/` in a laptop or workstation browser.
Original file line number Diff line number Diff line change
Expand Up @@ -44,34 +44,33 @@ When accessing the URLs listed below, it will be necessary to accept one or more
logging in. The details of the security warning will indicate that a self-signed certificate/unknown issuer is being used for the site. Support for incorporation of certificates from Trusted Certificate
Authorities is planned for a future release.

### Prometheus
### VictoriaMetrics UI

URL: `https://prometheus.cmn.SYSTEM_DOMAIN_NAME/`
URL: `https://vmselect..cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmui`

Central Prometheus instance scrapes metrics from Kubernetes, Ceph, and the hosts (part of `kube-prometheus-stack` Helm chart).
Vmagent instance scrapes metrics from Kubernetes, Ceph, and the hosts (part of ` victoria-metrics-k8s-stack` Helm chart).

Prometheus generates alerts based on metrics and reports them to the Alertmanager. The 'Alerts' link at the top of the page will show all of the inactive, pending, and firing alerts on the system.
Victoria metrics generates alerts based on metrics and reports them to the Alertmanager. The 'Alerts' link at the top of the page will show all of the inactive, pending, and firing alerts on the system.
Clicking on any of the alerts will expand them, enabling users to use the 'Labels' data to discern the details of the alert. The details will also show the state of the alert, how long it has been
active, and the value for the alert.

For more information regarding the use of the Prometheus interface, see
[Getting Started/](https://prometheus.io/docs/prometheus/latest/getting_started/) in the Prometheus online documentation.
For more information regarding the use of the victoria metrics interface, see
[Getting Started/](https://docs.victoriametrics.com/) in the victoria metrics online documentation.

Some alerts may be falsely triggered. This occurs if they are alerts which will be improved in the future, or if they are alerts impacted by whether all software products have been installed yet.
See [Troubleshoot Prometheus Alerts](Troubleshoot_Prometheus_Alerts.md).

### Thanos
### VMalert

URL: `https://vmselect.cmn.SYSTEM_DOMAIN_NAME/select/0/prometheus/vmalert/`
VMAlert - executes a list of given alerting or recording rules against configured address.

URL: `https://thanos.cmn.SYSTEM_DOMAIN_NAME/`
The VMAlert CRD declaratively defines a desired VMAlert setup to run in a Kubernetes cluster.

Thanos is a set of components that can be composed into a highly available, multi Prometheus metric system with potentially unlimited storage capacity, if your Object Storage allows for it.
It leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric data in any object storage while retaining fast query latencies.
Additionally, it provides a global query view across all Prometheus installations and can merge data from Prometheus HA pairs.
It has few required config options - `datasource` and notifier are required, for other config parameters check doc.

For more information regarding the use of the Thanos interface, see
[Getting Started/](https://thanos.io/tip/thanos/getting-started.md/) in the thanos online documentation.
For each VMAlert resource, the Operator deploys a properly configured Deployment in the same namespace. The VMAlert Pods are configured to mount a list of `Configmaps` prefixed with <VMAlert-name>-number containing the configuration for alerting rules.

### Alertmanager

URL: `https://alertmanager.cmn.SYSTEM_DOMAIN_NAME/`

Expand Down

This file was deleted.

Loading

0 comments on commit 81cd2c6

Please sign in to comment.