Skip to content

Commit

Permalink
K8s/patroni health (#401)
Browse files Browse the repository at this point in the history
* update match rules for postgres triage

* minor updates

* update cli

* update cmd

* template updates

* add supports

* renames and adds sli

* template name adjustments

* add next step

* wip

* wip

* x

* fix path

* postgres updates

* add bash to runbook

* add default db container

* fix resource labels

* fix sli setup

* add escapes

* update titles

* update var

* x

* fix typo

* try postgresqls

* add database container config

* change preempt default age

* update workload name

* postgres touchups

* debug full match details

* x

* x

* fix match

* x

* x

* add dynamic backup age and add backup age to SLI

* update check from crunchy cr

* add next step

* add pgbouncer config checks and update tags
  • Loading branch information
stewartshea authored Jul 17, 2024
1 parent 5ab13a5 commit a8d7570
Show file tree
Hide file tree
Showing 22 changed files with 1,042 additions and 257 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ report.html
dist
build
*egg-info
.kube
.kube
*.out
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ spec:
configProvided:
- name: GCP_PROJECT_ID
value: {{match_resource.resource.project_id}}
- name: AGE
value: '30'
secretsProvided:
- name: gcp_credentials_json
workspaceKey: {{custom.gcp_ops_suite_sa}}
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ spec:
configProvided:
- name: GCP_PROJECT_ID
value: {{match_resource.resource.project_id}}
- name: AGE
value: '30'
secretsProvided:
- name: gcp_credentials_json
workspaceKey: {{custom.gcp_ops_suite_sa}}
4 changes: 2 additions & 2 deletions codebundles/gcloud-node-preempt/runbook.robot
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ Suite Initialization
... type=string
... description=The age, in minutes, since the preempt event.
... pattern=\d+
... default=15
... example=15
... default=30
... example=30
${OS_PATH}= Get Environment Variable PATH
Set Suite Variable ${GCP_PROJECT_ID} ${GCP_PROJECT_ID}
Set Suite Variable ${gcp_credentials_json} ${gcp_credentials_json}
Expand Down
4 changes: 2 additions & 2 deletions codebundles/gcloud-node-preempt/sli.robot
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ Suite Initialization
... type=string
... description=The age, in minutes, since the preempt event.
... pattern=\d+
... default=15
... example=15
... default=30
... example=30
${OS_PATH}= Get Environment Variable PATH
Set Suite Variable ${GCP_PROJECT_ID} ${GCP_PROJECT_ID}
Set Suite Variable ${gcp_credentials_json} ${gcp_credentials_json}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ spec:
asMeasuredBy: Node cpu and memory utilization.
configProvided:
- name: OBJECT_NAME
value: {{cluster.namee}}
value: {{cluster.name}}
owners:
- {{workspace.owner_email}}
statement: Cluster resources for {{cluster.context}} should be less than 90% utilization.
Expand Down
2 changes: 1 addition & 1 deletion codebundles/k8s-namespace-healthcheck/runbook.robot
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ Inspect Container Restarts In Namespace `${NAMESPACE}`

Inspect Pending Pods In Namespace `${NAMESPACE}`
[Documentation] Fetches pods that are pending and provides details.
[Tags] namespace pods status pending ${namespace}
[Tags] namespace pods status pending ${NAMESPACE}
${pending_pods}= RW.CLI.Run Cli
... cmd=${KUBERNETES_DISTRIBUTION_BINARY} get pods --context=${CONTEXT} -n ${NAMESPACE} --field-selector=status.phase=Pending --no-headers -o json | jq -r '[.items[] | {pod_name: .metadata.name, status: (.status.phase // "N/A"), message: (.status.conditions[0].message // "N/A"), reason: (.status.conditions[0].reason // "N/A"), containerStatus: (.status.containerStatuses[0].state // "N/A"), containerMessage: (.status.containerStatuses[0].state.waiting?.message // "N/A"), containerReason: (.status.containerStatuses[0].state.waiting?.reason // "N/A")}]'
... env=${env}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
apiVersion: runwhen.com/v1
kind: GenerationRules
spec:
generationRules:
- resourceTypes:
- postgresclusters.postgres-operator.crunchydata.com
matchRules:
- type: pattern
pattern: ".+"
properties: [name]
mode: substring
slxs:
- baseName: postgres-health
qualifiers: ["resource", "namespace", "cluster"]
baseTemplateName: k8s-postgres-healthcheck-crunchy
levelOfDetail: detailed
outputItems:
- type: slx
- type: sli
- type: runbook
templateName: k8s-postgres-healthcheck-crunchy-taskset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,19 @@ kind: GenerationRules
spec:
generationRules:
- resourceTypes:
- postgresql.acid.zalan.do
- postgresqls.acid.zalan.do
matchRules:
- type: pattern
pattern: ".+"
properties: [name]
mode: substring
slxs:
- baseName: postgres-health
qualifiers: ["namespace", "cluster"]
baseTemplateName: k8s-postgres-triage
qualifiers: ["resource", "namespace", "cluster"]
baseTemplateName: k8s-postgres-healthcheck-zalando
levelOfDetail: detailed
outputItems:
- type: slx
- type: sli
- type: runbook
templateName: k8s-postgres-triage-taskset.yaml
templateName: k8s-postgres-healthcheck-zalando-taskset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelIndicator
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
displayUnitsLong: OK
displayUnitsShort: ok
locations:
- {{default_location}}
description: Measures the health of a postgres cluster by scoring database lag, backups, and member readiness.
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/k8s-postgres-healthcheck/sli.robot
intervalStrategy: intermezzo
intervalSeconds: 60
configProvided:
- name: DISTRIBUTION
value: {{custom.kubernetes_distribution}}
- name: NAMESPACE
value: '{{match_resource.resource.metadata.namespace}}'
- name: CONTEXT
value: {{context}}
- name: KUBERNETES_DISTRIBUTION_BINARY
value: {{custom.kubernetes_distribution_binary}}
- name: WORKLOAD_NAME
value: '-l postgres-operator.crunchydata.com/cluster={{match_resource.resource.metadata.name}},postgres-operator.crunchydata.com/role=master'
- name: RESOURCE_LABELS
value: postgres-operator.crunchydata.com/cluster={{match_resource.resource.metadata.name}}
- name: OBJECT_API_VERSION
value: '{{match_resource.resource.apiVersion}}'
- name: OBJECT_NAME
value: '{{match_resource.resource.metadata.name}}'
- name: DATABASE_CONTAINER
value: 'database'
- name: OBJECT_KIND
value: {{match_resource.resource.kind}}.{{match_resource.resource.apiVersion.split('/')[0]}}
secretsProvided:
- name: kubeconfig
workspaceKey: {{custom.kubeconfig_secret_name}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelX
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
imageURL: https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/CrunchyDataPrimaryIcon.png
alias: {{match_resource.resource.metadata.name}} Postgres Health
asMeasuredBy: Database is up and accepting connections.
configProvided:
- name: OBJECT_NAME
value: {{match_resource.resource.metadata.name}}
owners:
- {{workspace.owner_email}}
statement: Database should be available and accept connections 99.5% of the time.
additionalContext:
namespace: "{{match_resource.resource.metadata.namespace}}"
labelMap: "{{match_resource.resource.metadata.labels}}"
cluster: "{{ cluster.name }}"
context: "{{ cluster.context }}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
apiVersion: runwhen.com/v1
kind: Runbook
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
location: {{default_location}}
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/k8s-postgres-healthcheck/runbook.robot
configProvided:
- name: DISTRIBUTION
value: {{custom.kubernetes_distribution}}
- name: NAMESPACE
value: '{{match_resource.resource.metadata.namespace}}'
- name: CONTEXT
value: {{context}}
- name: KUBERNETES_DISTRIBUTION_BINARY
value: {{custom.kubernetes_distribution_binary}}
- name: WORKLOAD_NAME
value: '-l postgres-operator.crunchydata.com/cluster={{match_resource.resource.metadata.name}},postgres-operator.crunchydata.com/role=master'
- name: RESOURCE_LABELS
value: postgres-operator.crunchydata.com/cluster={{match_resource.resource.metadata.name}}
- name: OBJECT_API_VERSION
value: '{{match_resource.resource.apiVersion}}'
- name: OBJECT_NAME
value: '{{match_resource.resource.metadata.name}}'
- name: DATABASE_CONTAINER
value: 'database'
- name: OBJECT_KIND
value: {{match_resource.resource.kind}}.{{match_resource.resource.apiVersion.split('/')[0]}}
secretsProvided:
- name: kubeconfig
workspaceKey: {{custom.kubeconfig_secret_name}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
apiVersion: runwhen.com/v1
kind: ServiceLevelIndicator
metadata:
name: {{slx_name}}
labels:
{% include "common-labels.yaml" %}
annotations:
{% include "common-annotations.yaml" %}
spec:
displayUnitsLong: OK
displayUnitsShort: ok
locations:
- {{default_location}}
description: Measures the health of a postgres cluster by scoring database lag, backups, and member readiness.
codeBundle:
{% if repo_url %}
repoUrl: {{repo_url}}
{% else %}
repoUrl: https://github.com/runwhen-contrib/rw-cli-codecollection.git
{% endif %}
{% if ref %}
ref: {{ref}}
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/k8s-postgres-healthcheck/sli.robot
intervalStrategy: intermezzo
intervalSeconds: 60
configProvided:
- name: DISTRIBUTION
value: {{custom.kubernetes_distribution}}
- name: NAMESPACE
value: '{{match_resource.resource.metadata.namespace}}'
- name: CONTEXT
value: {{context}}
- name: KUBERNETES_DISTRIBUTION_BINARY
value: {{custom.kubernetes_distribution_binary}}
- name: WORKLOAD_NAME
value: '-l application=spilo,cluster-name={{match_resource.resource.metadata.name}}'
- name: RESOURCE_LABELS
value: 'cluster-name={{match_resource.resource.metadata.name}}'
- name: OBJECT_API_VERSION
value: '{{match_resource.resource.apiVersion}}'
- name: OBJECT_NAME
value: '{{match_resource.resource.metadata.name}}'
- name: DATABASE_CONTAINER
value: 'postgres'
- name: OBJECT_KIND
value: {{match_resource.resource.kind}}.{{match_resource.resource.apiVersion.split('/')[0]}}
secretsProvided:
- name: kubeconfig
workspaceKey: {{custom.kubeconfig_secret_name}}
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ spec:
{% else %}
ref: main
{% endif %}
pathToRobot: codebundles/k8s-postgres-triage/runbook.robot
pathToRobot: codebundles/k8s-postgres-healthcheck/runbook.robot
configProvided:
- name: DISTRIBUTION
value: {{custom.kubernetes_distribution}}
Expand All @@ -30,9 +30,17 @@ spec:
- name: KUBERNETES_DISTRIBUTION_BINARY
value: {{custom.kubernetes_distribution_binary}}
- name: WORKLOAD_NAME
value: 'statefulset.apps/{{match_resource.resource.metadata.name}}'
value: '-l application=spilo,cluster-name={{match_resource.resource.metadata.name}}'
- name: RESOURCE_LABELS
value: 'cluster-name={{match_resource.resource.metadata.name}}'
- name: OBJECT_API_VERSION
value: '{{match_resource.resource.apiVersion}}'
- name: OBJECT_NAME
value: '{{match_resource.resource.metadata.name}}'
- name: DATABASE_CONTAINER
value: 'postgres'
- name: OBJECT_KIND
value: {{match_resource.resource.kind}}.{{match_resource.resource.apiVersion.split('/')[0]}}
secretsProvided:
- name: kubeconfig
workspaceKey: {{custom.kubeconfig_secret_name}}
Loading

0 comments on commit a8d7570

Please sign in to comment.