Skip to content

Commit

Permalink
Update pvc issues and statefulset titles
Browse files Browse the repository at this point in the history
  • Loading branch information
jon-funk committed Nov 20, 2023
1 parent eb77ba6 commit f2ede77
Show file tree
Hide file tree
Showing 2 changed files with 66 additions and 22 deletions.
80 changes: 62 additions & 18 deletions codebundles/k8s-pvc-healthcheck/runbook.robot
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,16 @@ Suite Setup Suite Initialization


*** Tasks ***
Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims
Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims In Namespace `${NAMESPACE}`
[Documentation] Lists events related to PersistentVolumeClaims within the namespace that are not bound to PersistentVolumes.
[Tags] pvc list kubernetes storage persistentvolumeclaim persistentvolumeclaims events check event output and related nodes, PersistentVolumes, PersistentVolumeClaims, image registry authenticaiton, or fluxcd or argocd logs.
[Tags]
... pvc
... list
... kubernetes
... storage
... persistentvolumeclaim
... persistentvolumeclaims events
... check event output and related nodes, persistentvolumes, persistentvolumeclaims, image registry authenticaiton, or fluxcd or argocd logs.
${unbound_pvc_events}= RW.CLI.Run Cli
... cmd=for pvc in $(${KUBERNETES_DISTRIBUTION_BINARY} get pvc -n ${NAMESPACE} --context ${CONTEXT} -o json | jq -r '.items[] | select(.status.phase != "Bound") | .metadata.name'); do ${KUBERNETES_DISTRIBUTION_BINARY} get events -n ${NAMESPACE} --context ${CONTEXT} --field-selector involvedObject.name=$pvc -o json | jq '.items[]| "Last Timestamp: " + .lastTimestamp + " Name: " + .involvedObject.name + " Message: " + .message'; done
... env=${env}
Expand All @@ -34,16 +41,17 @@ Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims
... set_issue_expected=PVCs should be bound
... set_issue_actual=PVCs found pending with the following events
... set_issue_title=PVC Errors & Events In Namespace ${NAMESPACE}
... set_issue_details=We found "$line" in the namespace ${NAMESPACE}\nReview list of unbound PersistentVolumeClaims - check node events, application configurations, StorageClasses and CSI drivers.
... set_issue_details=We found "$line" in the namespace ${NAMESPACE}
... set_issue_next_steps=Review list of unbound `PersistentVolumeClaims` in namespace `${NAMESPACE}`\nCheck `Node` `Events`, `StorageClasses` and `CSI drivers`\nReview your application configurations
... line__raise_issue_if_contains=Name
${history}= RW.CLI.Pop Shell History
RW.Core.Add Pre To Report Summary of events for unbound pvc in ${NAMESPACE}:
RW.Core.Add Pre To Report ${unbound_pvc_events.stdout}
RW.Core.Add Pre To Report Commands Used:\n${history}

List PersistentVolumeClaims in Terminating State
List PersistentVolumeClaims in Terminating State In Namespace `${NAMESPACE}`
[Documentation] Lists persistentvolumeclaims in a Terminating state.
[Tags] pvc list kubernetes storage persistentvolumeclaim terminating check PersistentVolumes
[Tags] pvc list kubernetes storage persistentvolumeclaim terminating check persistentvolumes
${terminating_pvcs}= RW.CLI.Run Cli
... cmd=namespace=${NAMESPACE}; context=${CONTEXT}; ${KUBERNETES_DISTRIBUTION_BINARY} get pvc -n $namespace --context=$context -o json | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name as $name | .metadata.deletionTimestamp as $deletion_time | .metadata.finalizers as $finalizers | "\\($name) is in Terminating state (Deletion started at: \\($deletion_time)). Finalizers: \\($finalizers)"'
... env=${env}
Expand All @@ -54,10 +62,17 @@ List PersistentVolumeClaims in Terminating State
RW.Core.Add Pre To Report ${terminating_pvcs.stdout}
RW.Core.Add Pre To Report Commands Used:\n${history}


List PersistentVolumes in Terminating State
List PersistentVolumes in Terminating State In Namespace `${NAMESPACE}`
[Documentation] Lists events related to persistent volumes in Terminating state.
[Tags] pv list kubernetes storage persistentvolume terminating events check event output and related nodes, PersistentVolumes, PersistentVolumeClaims, image registry authenticaiton, or fluxcd or argocd logs.
[Tags]
... pv
... list
... kubernetes
... storage
... persistentvolume
... terminating
... events
... check event output and related nodes, persistentvolumes, persistentvolumeclaims, image registry authenticaiton, or fluxcd or argocd logs.
${dangline_pvcs}= RW.CLI.Run Cli
... cmd=for pv in $(${KUBERNETES_DISTRIBUTION_BINARY} get pv --context ${CONTEXT} -o json | jq -r '.items[] | select(.status.phase == "Terminating") | .metadata.name'); do ${KUBERNETES_DISTRIBUTION_BINARY} get events --all-namespaces --field-selector involvedObject.name=$pv --context ${CONTEXT} -o json | jq '.items[]| "Last Timestamp: " + .lastTimestamp + " Name: " + .involvedObject.name + " Message: " + .message'; done
... env=${env}
Expand All @@ -72,16 +87,25 @@ List PersistentVolumes in Terminating State
... set_issue_expected=PV should not be stuck terminating.
... set_issue_actual=PV is in a terminating state.
... set_issue_title=PV Events While Terminating In Namespace ${NAMESPACE}
... set_issue_details=We found "$_line" in the namespace ${NAMESPACE}\nCheck the status of terminating PersistentVolumeClaims over the next few minutes, they should disappear. If not, check that deployments or statefulsets attached to the PersistentVolumeClaims are scaled down and pods attached to the PersistentVolumeClaims are not running.
... set_issue_details=We found "$_line" in the namespace ${NAMESPACE}
... set_issue_next_steps=Review `PersistentVolumeClaims` in ${NAMESPACE} after waiting a couple minutes to see if they resolve\nCheck Health of `Deployments` and `StatefulSets` mounting the volumes in `${NAMESPACE}`\nEnsure no `Pods` attached to the `PersistentVolumeClaims` are status=`Running` in namespace `${NAMESPACE}` as this can prevent them from terminating
... _line__raise_issue_if_contains=Name
${history}= RW.CLI.Pop Shell History
RW.Core.Add Pre To Report Summary of events for dangling persistent volumes:
RW.Core.Add Pre To Report ${dangline_pvcs.stdout}
RW.Core.Add Pre To Report Commands Used:\n${history}

List Pods with Attached Volumes and Related PersistentVolume Details
List Pods with Attached Volumes and Related PersistentVolume Details In Namespace `${NAMESPACE}`
[Documentation] For each pod in a namespace, collect details on configured PersistentVolumeClaim, PersistentVolume, and node.
[Tags] pod storage pvc pv status csi storagereport check event output and related nodes, PersistentVolumes, PersistentVolumeClaims, image registry authenticaiton, or fluxcd or argocd logs.
[Tags]
... pod
... storage
... pvc
... pv
... status
... csi
... storagereport
... check event output and related nodes, persistentvolumes, persistentvolumeclaims, image registry authenticaiton, or fluxcd or argocd logs.
${pod_storage_report}= RW.CLI.Run Cli
... cmd=for pod in $(${KUBERNETES_DISTRIBUTION_BINARY} get pods -n ${NAMESPACE} --field-selector=status.phase=Running --context ${CONTEXT} -o jsonpath='{range .items[*]}{.metadata.name}{"\\n"}{end}'); do for pvc in $(${KUBERNETES_DISTRIBUTION_BINARY} get pods $pod -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{range .spec.volumes[*]}{.persistentVolumeClaim.claimName}{"\\n"}{end}'); do pv=$(${KUBERNETES_DISTRIBUTION_BINARY} get pvc $pvc -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{.spec.volumeName}') && status=$(${KUBERNETES_DISTRIBUTION_BINARY} get pv $pv --context ${CONTEXT} -o jsonpath='{.status.phase}') && node=$(${KUBERNETES_DISTRIBUTION_BINARY} get pod $pod -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{.spec.nodeName}') && zone=$(${KUBERNETES_DISTRIBUTION_BINARY} get nodes $node --context ${CONTEXT} -o jsonpath='{.metadata.labels.topology\\.kubernetes\\.io/zone}') && ingressclass=$(${KUBERNETES_DISTRIBUTION_BINARY} get pvc $pvc -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{.spec.storageClassName}') && accessmode=$(${KUBERNETES_DISTRIBUTION_BINARY} get pvc $pvc -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{.status.accessModes[0]}') && reclaimpolicy=$(${KUBERNETES_DISTRIBUTION_BINARY} get pv $pv --context ${CONTEXT} -o jsonpath='{.spec.persistentVolumeReclaimPolicy}') && csidriver=$(${KUBERNETES_DISTRIBUTION_BINARY} get pv $pv --context ${CONTEXT} -o jsonpath='{.spec.csi.driver}')&& echo -e "\\n---\\nPod: $pod\\nPVC: $pvc\\nPV: $pv\\nStatus: $status\\nNode: $node\\nZone: $zone\\nIngressClass: $ingressclass\\nAccessModes: $accessmode\\nReclaimPolicy: $reclaimpolicy\\nCSIDriver: $csidriver\\n"; done; done
... env=${env}
Expand All @@ -92,9 +116,18 @@ List Pods with Attached Volumes and Related PersistentVolume Details
RW.Core.Add Pre To Report ${pod_storage_report.stdout}
RW.Core.Add Pre To Report Commands Used:\n${history}

Fetch the Storage Utilization for PVC Mounts
Fetch the Storage Utilization for PVC Mounts In Namespace `${NAMESPACE}`
[Documentation] For each pod in a namespace, fetch the utilization of any PersistentVolumeClaims mounted using the linux df command. Requires kubectl exec permissions.
[Tags] pod storage pvc utilization capacity persistentvolumeclaims persistentvolumeclaim check pvc check event output and related nodes, PersistentVolumes, PersistentVolumeClaims, image registry authenticaiton, or fluxcd or argocd logs.
[Tags]
... pod
... storage
... pvc
... utilization
... capacity
... persistentvolumeclaims
... persistentvolumeclaim
... check pvc
... check event output and related nodes, persistentvolumes, persistentvolumeclaims, image registry authenticaiton, or fluxcd or argocd logs.
${pod_pvc_utilization}= RW.CLI.Run Cli
... cmd=for pod in $(${KUBERNETES_DISTRIBUTION_BINARY} get pods -n ${NAMESPACE} --field-selector=status.phase=Running --context ${CONTEXT} -o jsonpath='{range .items[*]}{.metadata.name}{"\\n"}{end}'); do for pvc in $(${KUBERNETES_DISTRIBUTION_BINARY} get pods $pod -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{range .spec.volumes[*]}{.persistentVolumeClaim.claimName}{"\\n"}{end}'); do for volumeName in $(${KUBERNETES_DISTRIBUTION_BINARY} get pod $pod -n ${NAMESPACE} --context ${CONTEXT} -o json | jq -r '.spec.volumes[] | select(has("persistentVolumeClaim")) | .name'); do mountPath=$(${KUBERNETES_DISTRIBUTION_BINARY} get pod $pod -n ${NAMESPACE} --context ${CONTEXT} -o json | jq -r --arg vol "$volumeName" '.spec.containers[].volumeMounts[] | select(.name == $vol) | .mountPath'); containerName=$(${KUBERNETES_DISTRIBUTION_BINARY} get pod $pod -n ${NAMESPACE} --context ${CONTEXT} -o json | jq -r --arg vol "$volumeName" '.spec.containers[] | select(.volumeMounts[].name == $vol) | .name'); echo -e "\\n---\\nPod: $pod, PVC: $pvc, volumeName: $volumeName, containerName: $containerName, mountPath: $mountPath"; ${KUBERNETES_DISTRIBUTION_BINARY} exec $pod -n ${NAMESPACE} --context ${CONTEXT} -c $containerName -- df -h $mountPath; done; done; done;
... env=${env}
Expand All @@ -116,15 +149,24 @@ Fetch the Storage Utilization for PVC Mounts
... set_issue_title=PVC Storage Utilization As Report by Pod
... set_issue_details=Found excessive PVC Utilization for: \n${unhealthy_volume_capacity.stdout}
... _line__raise_issue_if_contains=Pod
... set_issue_next_steps=Clean up or expand Persistent Volume Claims for: \n ${unhealthy_volume_list.stdout}
... set_issue_next_steps=Clean up or expand `PersistentVolumeClaims` in namespace `${NAMESPACE}` for: \n ${unhealthy_volume_list.stdout}
${history}= RW.CLI.Pop Shell History
RW.Core.Add Pre To Report Summary of PVC storage mount utilization in ${NAMESPACE}:
RW.Core.Add Pre To Report ${pod_pvc_utilization.stdout}
RW.Core.Add Pre To Report Commands Used:\n${history}

Check for RWO Persistent Volume Node Attachment Issues
[Documentation] For each pod in a namespace, check if it has an RWO persistent volume claim and if so, validate that the pod and the pv are on the same node.
[Tags] pod storage pvc readwriteonce node persistentvolumeclaims persistentvolumeclaim scheduled attachment
Check for RWO Persistent Volume Node Attachment Issues In Namespace `${NAMESPACE}`
[Documentation] For each pod in a namespace, check if it has an RWO persistent volume claim and if so, validate that the pod and the pv are on the same node.
[Tags]
... pod
... storage
... pvc
... readwriteonce
... node
... persistentvolumeclaims
... persistentvolumeclaim
... scheduled
... attachment
${pod_rwo_node_and_pod_attachment}= RW.CLI.Run Cli
... cmd=NAMESPACE="${NAMESPACE}"; CONTEXT="${CONTEXT}"; PODS=$(kubectl get pods -n $NAMESPACE --context=$CONTEXT -o json); for pod in $(jq -r '.items[] | @base64' <<< "$PODS"); do _jq() { jq -r \${1} <<< "$(base64 --decode <<< \${pod})"; }; POD_NAME=$(_jq '.metadata.name'); POD_NODE_NAME=$(kubectl get pod $POD_NAME -n $NAMESPACE --context=$CONTEXT -o custom-columns=:.spec.nodeName --no-headers); PVC_NAMES=$(kubectl get pod $POD_NAME -n $NAMESPACE --context=$CONTEXT -o jsonpath='{.spec.volumes[*].persistentVolumeClaim.claimName}'); for pvc_name in $PVC_NAMES; do PVC=$(kubectl get pvc $pvc_name -n $NAMESPACE --context=$CONTEXT -o json); ACCESS_MODE=$(jq -r '.spec.accessModes[0]' <<< "$PVC"); if [[ "$ACCESS_MODE" == "ReadWriteOnce" ]]; then PV_NAME=$(jq -r '.spec.volumeName' <<< "$PVC"); STORAGE_NODE_NAME=$(jq -r --arg pv "$PV_NAME" '.items[] | select(.status.volumesAttached != null) | select(.status.volumesInUse[] | contains($pv)) | .metadata.name' <<< "$(kubectl get nodes --context=$CONTEXT -o json)"); echo "-----"; if [[ "$POD_NODE_NAME" == "$STORAGE_NODE_NAME" ]]; then echo "OK: Pod and Storage Node Matched"; else echo "Error: Pod and Storage Node Mismatched - If the issue persists, the node requires attention."; fi; echo "Pod: $POD_NAME"; echo "PVC: $pvc_name"; echo "PV: $PV_NAME"; echo "Node with Pod: $POD_NODE_NAME"; echo "Node with Storage: $STORAGE_NODE_NAME"; echo; fi; done; done
... env=${env}
Expand All @@ -137,9 +179,11 @@ Check for RWO Persistent Volume Node Attachment Issues
... set_issue_actual=Pods with RWO found on a different node than their RWO storage: ${NAMESPACE}
... set_issue_title=Pods with RWO storage might not have storage scheduling issues for namespace: ${NAMESPACE}
... set_issue_details=All Pods and RWO their storage details are:\n\n$_stdout\n\n
... set_issue_next_steps=List `Pods` in namespace `${NAMESPACE}` and review the `Nodes` they're scheduled on\nReview Kubernetes `Scheduler` logs\nCheck `Node Affinity` and `Taints/Tolerations`
... _line__raise_issue_if_contains=Error
${history}= RW.CLI.Pop Shell History
RW.Core.Add Pre To Report Summary of Pods with RWO storage and the nodes their scheduling details for namespace: ${NAMESPACE}:
RW.Core.Add Pre To Report
... Summary of Pods with RWO storage and the nodes their scheduling details for namespace: ${NAMESPACE}:
RW.Core.Add Pre To Report ${pod_rwo_node_and_pod_attachment.stdout}
RW.Core.Add Pre To Report Commands Used:\n${history}

Expand Down
8 changes: 4 additions & 4 deletions codebundles/k8s-statefulset-healthcheck/runbook.robot
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Suite Setup Suite Initialization


*** Tasks ***
Fetch StatefulSet Logs
Fetch StatefulSet `${STATEFULSET_NAME}` Logs
[Documentation] Fetches the last 100 lines of logs for the given statefulset in the namespace.
[Tags] fetch log pod container errors inspect trace info statefulset
${logs}= RW.CLI.Run Cli
Expand All @@ -26,7 +26,7 @@ Fetch StatefulSet Logs
RW.Core.Add Pre To Report ${logs.stdout}
RW.Core.Add Pre To Report Commands Used: ${history}

Get Related StatefulSet Events
Get Related StatefulSet `${STATEFULSET_NAME}` Events
[Documentation] Fetches events related to the StatefulSet workload in the namespace.
[Tags] events workloads errors warnings get statefulset
${events}= RW.CLI.Run Cli
Expand All @@ -38,7 +38,7 @@ Get Related StatefulSet Events
RW.Core.Add Pre To Report ${events.stdout}
RW.Core.Add Pre To Report Commands Used: ${history}

Fetch StatefulSet Manifest Details
Fetch StatefulSet `${STATEFULSET_NAME}` Manifest Details
[Documentation] Fetches the current state of the statefulset manifest for inspection.
[Tags] statefulset details manifest info
${statefulset}= RW.CLI.Run Cli
Expand All @@ -50,7 +50,7 @@ Fetch StatefulSet Manifest Details
RW.Core.Add Pre To Report ${statefulset.stdout}
RW.Core.Add Pre To Report Commands Used: ${history}

List StatefulSets with Unhealthy Replica Counts
List StatefulSets with Unhealthy Replica Counts In Namespace `${NAMESPACE}`
[Documentation] Pulls the replica information for a given StatefulSet and checks if it's highly available
... , if the replica counts are the expected / healthy values, and if not, what they should be.
[Tags]
Expand Down

0 comments on commit f2ede77

Please sign in to comment.