Skip to content

Commit

Permalink
Bug fixes, task tweaks for certmanager and add genrules (#238)
Browse files Browse the repository at this point in the history
* Add loki, kubeprom genrules

* Migrate genrules for certmanager and roll in expiration check to sli

* Add accept statuscode 5 to jq parse
  • Loading branch information
jon-funk authored Nov 6, 2023
1 parent 03bf417 commit 3ca79cd
Show file tree
Hide file tree
Showing 12 changed files with 350 additions and 36 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: runwhen.com/v1
kind: GenerationRules
spec:
generationRules:
- resourceTypes:
- certificates.cert-manager.io
- certificaterequests.cert-manager.io
matchRules:
- type: and
matches:
- type: pattern
pattern: ".+"
properties: [name]
mode: substring
slxs:
- baseName: cert-health
qualifiers: ["resource", "namespace", "cluster"]
baseTemplateName: k8s-certmanager-certificate-health
levelOfDetail: basic
outputItems:
- type: slx
- type: sli
- type: runbook
templateName: k8s-certmanager-certificate-health-taskset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelIndicator
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
displayUnitsLong: Number
displayUnitsShort: '#'
locations:
- {{default_location}}
description: Measures ____
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/k8s-certmanager-healthcheck/sli.robot
intervalStrategy: intermezzo
intervalSeconds: 30
configProvided:
- name: DISTRIBUTION
value: {{custom.kubernetes_distribution}}
- name: NAMESPACE
value: '{{match_resource.resource.metadata.namespace}}'
- name: CONTEXT
value: {{context}}
- name: KUBERNETES_DISTRIBUTION_BINARY
value: {{custom.kubernetes_distribution_binary}}
secretsProvided:
- name: kubeconfig
workspaceKey: {{custom.kubeconfig_secret_name}}
servicesProvided:
- name: kubectl
locationServiceName: kubectl-service.shared
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelX
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/cert-manager.svg
alias: {{namespace.name}} SSL Certificate Health
asMeasuredBy: Certificates in an unready state
configProvided:
- name: OBJECT_NAME
value: {{match_resource.resource.metadata.name}}
owners:
- {{workspace.owner_email}}
statement: All certificates should be in a Ready state 99.5%.
additionalContext:
namespace: "{{match_resource.resource.metadata.namespace}}"
labelMap: "{{match_resource.resource.metadata.labels}}"
cluster: "{{ cluster.name }}"
context: "{{ cluster.context }}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
apiVersion: runwhen.com/v1
kind: Runbook
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
location: {{default_location}}
codeBundle:
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection
ref: main
pathToRobot: codebundles/k8s-certmanager-healthcheck/runbook.robot
configProvided:
- name: DISTRIBUTION
value: {{custom.kubernetes_distribution}}
- name: NAMESPACE
value: '{{match_resource.resource.metadata.namespace}}'
- name: CONTEXT
value: {{context}}
- name: KUBERNETES_DISTRIBUTION_BINARY
value: {{custom.kubernetes_distribution_binary}}
secretsProvided:
- name: kubeconfig
workspaceKey: {{custom.kubeconfig_secret_name}}
servicesProvided:
- name: {{custom.kubernetes_distribution_binary}}
locationServiceName: {{custom.kubernetes_distribution_binary}}-service.shared
108 changes: 72 additions & 36 deletions codebundles/k8s-certmanager-healthcheck/sli.robot
Original file line number Diff line number Diff line change
@@ -1,18 +1,60 @@
*** Settings ***
Metadata Author jon-funk
Documentation Check the health of pods deployed by cert-manager.
Metadata Display Name Kubernetes CertManager Healthcheck
Metadata Supports Kubernetes,AKS,EKS,GKE,OpenShift
Suite Setup Suite Initialization
Library BuiltIn
Library RW.Core
Library RW.CLI
Library RW.platform
Library OperatingSystem
Documentation Check the health of pods deployed by cert-manager.
Metadata Author jon-funk
Metadata Display Name Kubernetes CertManager Healthcheck
Metadata Supports Kubernetes,AKS,EKS,GKE,OpenShift

Library BuiltIn
Library RW.Core
Library RW.CLI
Library RW.platform
Library OperatingSystem

Suite Setup Suite Initialization


*** Tasks ***
Get Health Score of CertManager Workloads
[Documentation] Returns a score of 1 when all cert-manager pods are healthy, or 0 otherwise.
[Tags] pods containers running status count health certmanager cert
# count expired certs
${expired_certs}= RW.CLI.Run Cli
... cmd=${KUBERNETES_DISTRIBUTION_BINARY} get --context=${CONTEXT} --all-namespaces certificates -ojson | jq '[.items[] | select(.status.notAfter != null) | {notAfter: .status.notAfter} | .notAfter |= (gsub("[-:TZ]"; "") | strptime("%Y%m%d%H%M%S") | mktime) | select(.notAfter < now)] | length'
... env=${env}
... secret_file__kubeconfig=${kubeconfig}
${ec_count}= Set Variable ${expired_certs.stdout}
${ec_health_score}= Set Variable 1
IF isinstance($ec_count, int) and $ec_count > 0
${ec_health_score}= Evaluate 0.5 / ${ec_count}
ELSE
${ec_health_score}= Set Variable 0.5
END
# check certmanager workloads healthy
${cm_pods}= RW.CLI.Run Cli
... cmd=${KUBERNETES_DISTRIBUTION_BINARY} get pods --context=${CONTEXT} -n ${NAMESPACE} -ojson
... env=${env}
... secret_file__kubeconfig=${kubeconfig}
${not_ready_count}= RW.CLI.Parse Cli Json Output
... rsp=${cm_pods}
... extract_path_to_var__cm_stats=items[].{name:metadata.name, containers_ready:status.containerStatuses[].ready, containers_started:status.containerStatuses[].started}
... from_var_with_path__cm_stats__to__not_ready_containers=length([].containers_ready[?@ == `false`][])
... assign_stdout_from_var=not_ready_containers
${not_started_count}= RW.CLI.Parse Cli Json Output
... rsp=${cm_pods}
... extract_path_to_var__cm_stats=items[].{name:metadata.name, containers_ready:status.containerStatuses[].ready, containers_started:status.containerStatuses[].started}
... from_var_with_path__cm_stats__to__not_started_containers=length([].containers_started[?@ == `false`][])
... assign_stdout_from_var=not_started_containers
${not_ready_count}= Convert to Number ${not_ready_count.stdout}
${not_started_count}= Convert to Number ${not_started_count.stdout}
${cm_health_score}= Evaluate 0.5 if ${not_ready_count} == 0 and ${not_started_count} == 0 else 0
${metric}= Evaluate ${cm_health_score} + ${ec_health_score}
RW.Core.Push Metric ${metric}


*** Keywords ***
Suite Initialization
${kubeconfig}= RW.Core.Import Secret kubeconfig
${kubeconfig}= RW.Core.Import Secret
... kubeconfig
... type=string
... description=The kubernetes kubeconfig yaml containing connection configuration used to connect to cluster(s).
... pattern=\w*
Expand All @@ -21,9 +63,10 @@ Suite Initialization
... description=The location service used to interpret shell commands.
... default=kubectl-service.shared
... example=kubectl-service.shared
${NAMESPACE}= RW.Core.Import User Variable NAMESPACE
${NAMESPACE}= RW.Core.Import User Variable
... NAMESPACE
... type=string
... description=The name of the Kubernetes namespace to scope actions and searching to. Supports csv list of namespaces.
... description=The name of the Kubernetes namespace to scope actions and searching to. Supports csv list of namespaces.
... pattern=\w*
... default=cert-manager
... example=cert-manager
Expand All @@ -32,31 +75,24 @@ Suite Initialization
... description=Which Kubernetes context to operate within.
... pattern=\w*
... example=my-main-cluster

${DISTRIBUTION}= RW.Core.Import User Variable DISTRIBUTION
... type=string
... description=Which distribution of Kubernetes to use for operations, such as: Kubernetes, OpenShift, etc.
... pattern=\w*
... enum=[Kubernetes,GKE,OpenShift]
... example=Kubernetes
... default=Kubernetes

${KUBERNETES_DISTRIBUTION_BINARY}= RW.Core.Import User Variable KUBERNETES_DISTRIBUTION_BINARY
... type=string
... description=Which binary to use for Kubernetes CLI commands.
... enum=[kubectl,oc]
... example=kubectl
... default=kubectl
Set Suite Variable ${KUBERNETES_DISTRIBUTION_BINARY} ${KUBERNETES_DISTRIBUTION_BINARY}
Set Suite Variable ${kubeconfig} ${kubeconfig}
Set Suite Variable ${kubectl} ${kubectl}
Set Suite Variable ${CONTEXT} ${CONTEXT}
Set Suite Variable ${NAMESPACE} ${NAMESPACE}
Set Suite Variable ${env} {"KUBECONFIG":"./${kubeconfig.key}"}

*** Tasks ***
Get Health Score of CertManager Workloads
[Documentation] Returns a score of 1 when all cert-manager pods are healthy, or 0 otherwise.
[Tags] Pods Containers Running Status Count Health CertManager Cert
${cm_pods}= RW.CLI.Run Cli
... cmd=kubectl get pods --context=${CONTEXT} -n ${NAMESPACE} -ojson
... env=${env}
... secret_file__kubeconfig=${kubeconfig}
${not_ready_count}= RW.CLI.Parse Cli Json Output
... rsp=${cm_pods}
... extract_path_to_var__cm_stats=items[].{name:metadata.name, containers_ready:status.containerStatuses[].ready, containers_started:status.containerStatuses[].started}
... from_var_with_path__cm_stats__to__not_ready_containers=length([].containers_ready[?@ == `false`][])
... assign_stdout_from_var=not_ready_containers
${not_started_count}= RW.CLI.Parse Cli Json Output
... rsp=${cm_pods}
... extract_path_to_var__cm_stats=items[].{name:metadata.name, containers_ready:status.containerStatuses[].ready, containers_started:status.containerStatuses[].started}
... from_var_with_path__cm_stats__to__not_started_containers=length([].containers_started[?@ == `false`][])
... assign_stdout_from_var=not_started_containers
${not_ready_count}= Convert to Number ${not_ready_count.stdout}
${not_started_count}= Convert to Number ${not_started_count.stdout}
${metric}= Evaluate 1 if ${not_ready_count} == 0 and ${not_started_count} == 0 else 0
RW.Core.Push Metric ${metric}
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: runwhen.com/v1
kind: GenerationRules
spec:
generationRules:
- resourceTypes:
- statefulset
matchRules:
- type: pattern
pattern: "loki"
properties: [name]
mode: substring
slxs:
- baseName: loki-hlthck
qualifiers: ["resource", "namespace", "cluster"]
baseTemplateName: k8s-loki-healthcheck
levelOfDetail: detailed
outputItems:
- type: slx
- type: runbook
templateName: k8s-loki-healthcheck-taskset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelX
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/grafana-loki.svg
alias: Loki Stack Health
asMeasuredBy: The Loki stack is up, and healthy.
configProvided:
- name: OBJECT_NAME
value: {{match_resource.resource.metadata.name}}
owners:
- {{workspace.owner_email}}
statement: Loki's stack should be up, healthy with a up-to-date hash ring in the {{namespace.name}} namespace.
additionalContext:
namespace: "{{match_resource.resource.metadata.namespace}}"
labelMap: "{{match_resource.resource.metadata.labels}}"
cluster: "{{ cluster.name }}"
context: "{{ cluster.context }}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
apiVersion: runwhen.com/v1
kind: Runbook
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
location: {{default_location}}
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/k8s-loki-healthcheck/runbook.robot
configProvided:
- name: NAMESPACE
value: '{{match_resource.resource.metadata.namespace}}'
- name: CONTEXT
value: {{context}}
- name: KUBERNETES_DISTRIBUTION_BINARY
value: {{custom.kubernetes_distribution_binary}}
secretsProvided:
- name: kubeconfig
workspaceKey: {{custom.kubeconfig_secret_name}}
servicesProvided:
- name: {{custom.kubernetes_distribution_binary}}
locationServiceName: {{custom.kubernetes_distribution_binary}}-service.shared
1 change: 1 addition & 0 deletions codebundles/k8s-namespace-healthcheck/runbook.robot
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,7 @@ Check For Namespace Event Anomalies
... deployment_name=${pod_name.stdout}
RW.CLI.Parse Cli Output By Line
... rsp=${recent_anomalies}
... expected_rsp_returncodes=[0,5]
... set_severity_level=2
... set_issue_expected=No unusual recent anomaly events with high counts in the namespace ${NAMESPACE}
... set_issue_actual=We detected events in the namespace ${NAMESPACE} which are considered anomalies
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
apiVersion: runwhen.com/v1
kind: GenerationRules
spec:
generationRules:
- resourceTypes:
- prometheuses.monitoring.coreos.com
matchRules:
- type: pattern
pattern: ".+"
properties: [name]
mode: substring
slxs:
- baseName: kubeprom-hlthck
qualifiers: ["resource", "namespace", "cluster"]
baseTemplateName: k8s-prometheus-healthcheck
levelOfDetail: detailed
outputItems:
- type: slx
- type: runbook
templateName: k8s-prometheus-healthcheck-taskset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelX
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/prometheus_color.svg
alias: Kubeprometheus Operator Health
asMeasuredBy: The Kubeprometheus operator is healthy and its ServiceMonitors are functional.
configProvided:
- name: OBJECT_NAME
value: {{match_resource.resource.metadata.name}}
owners:
- {{workspace.owner_email}}
statement: The Kubeprometheus operator should be healthy in the {{namespace.name}} namespace and its ServiceMonitors are functional.
additionalContext:
namespace: "{{match_resource.resource.metadata.namespace}}"
labelMap: "{{match_resource.resource.metadata.labels}}"
cluster: "{{ cluster.name }}"
context: "{{ cluster.context }}"
Loading

0 comments on commit 3ca79cd

Please sign in to comment.