diff --git a/codebundles/k8s-pvc-healthcheck/runbook.robot b/codebundles/k8s-pvc-healthcheck/runbook.robot index 6a341140..a0729a65 100644 --- a/codebundles/k8s-pvc-healthcheck/runbook.robot +++ b/codebundles/k8s-pvc-healthcheck/runbook.robot @@ -17,9 +17,16 @@ Suite Setup Suite Initialization *** Tasks *** -Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims +Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims In Namespace `${NAMESPACE}` [Documentation] Lists events related to PersistentVolumeClaims within the namespace that are not bound to PersistentVolumes. - [Tags] pvc list kubernetes storage persistentvolumeclaim persistentvolumeclaims events check event output and related nodes, PersistentVolumes, PersistentVolumeClaims, image registry authenticaiton, or fluxcd or argocd logs. + [Tags] + ... pvc + ... list + ... kubernetes + ... storage + ... persistentvolumeclaim + ... persistentvolumeclaims events + ... check event output and related nodes, persistentvolumes, persistentvolumeclaims, image registry authenticaiton, or fluxcd or argocd logs. ${unbound_pvc_events}= RW.CLI.Run Cli ... cmd=for pvc in $(${KUBERNETES_DISTRIBUTION_BINARY} get pvc -n ${NAMESPACE} --context ${CONTEXT} -o json | jq -r '.items[] | select(.status.phase != "Bound") | .metadata.name'); do ${KUBERNETES_DISTRIBUTION_BINARY} get events -n ${NAMESPACE} --context ${CONTEXT} --field-selector involvedObject.name=$pvc -o json | jq '.items[]| "Last Timestamp: " + .lastTimestamp + " Name: " + .involvedObject.name + " Message: " + .message'; done ... env=${env} @@ -34,16 +41,17 @@ Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims ... set_issue_expected=PVCs should be bound ... set_issue_actual=PVCs found pending with the following events ... set_issue_title=PVC Errors & Events In Namespace ${NAMESPACE} - ... set_issue_details=We found "$line" in the namespace ${NAMESPACE}\nReview list of unbound PersistentVolumeClaims - check node events, application configurations, StorageClasses and CSI drivers. + ... set_issue_details=We found "$line" in the namespace ${NAMESPACE} + ... set_issue_next_steps=Review list of unbound `PersistentVolumeClaims` in namespace `${NAMESPACE}`\nCheck `Node` `Events`, `StorageClasses` and `CSI drivers`\nReview your application configurations ... line__raise_issue_if_contains=Name ${history}= RW.CLI.Pop Shell History RW.Core.Add Pre To Report Summary of events for unbound pvc in ${NAMESPACE}: RW.Core.Add Pre To Report ${unbound_pvc_events.stdout} RW.Core.Add Pre To Report Commands Used:\n${history} -List PersistentVolumeClaims in Terminating State +List PersistentVolumeClaims in Terminating State In Namespace `${NAMESPACE}` [Documentation] Lists persistentvolumeclaims in a Terminating state. - [Tags] pvc list kubernetes storage persistentvolumeclaim terminating check PersistentVolumes + [Tags] pvc list kubernetes storage persistentvolumeclaim terminating check persistentvolumes ${terminating_pvcs}= RW.CLI.Run Cli ... cmd=namespace=${NAMESPACE}; context=${CONTEXT}; ${KUBERNETES_DISTRIBUTION_BINARY} get pvc -n $namespace --context=$context -o json | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name as $name | .metadata.deletionTimestamp as $deletion_time | .metadata.finalizers as $finalizers | "\\($name) is in Terminating state (Deletion started at: \\($deletion_time)). Finalizers: \\($finalizers)"' ... env=${env} @@ -54,10 +62,17 @@ List PersistentVolumeClaims in Terminating State RW.Core.Add Pre To Report ${terminating_pvcs.stdout} RW.Core.Add Pre To Report Commands Used:\n${history} - -List PersistentVolumes in Terminating State +List PersistentVolumes in Terminating State In Namespace `${NAMESPACE}` [Documentation] Lists events related to persistent volumes in Terminating state. - [Tags] pv list kubernetes storage persistentvolume terminating events check event output and related nodes, PersistentVolumes, PersistentVolumeClaims, image registry authenticaiton, or fluxcd or argocd logs. + [Tags] + ... pv + ... list + ... kubernetes + ... storage + ... persistentvolume + ... terminating + ... events + ... check event output and related nodes, persistentvolumes, persistentvolumeclaims, image registry authenticaiton, or fluxcd or argocd logs. ${dangline_pvcs}= RW.CLI.Run Cli ... cmd=for pv in $(${KUBERNETES_DISTRIBUTION_BINARY} get pv --context ${CONTEXT} -o json | jq -r '.items[] | select(.status.phase == "Terminating") | .metadata.name'); do ${KUBERNETES_DISTRIBUTION_BINARY} get events --all-namespaces --field-selector involvedObject.name=$pv --context ${CONTEXT} -o json | jq '.items[]| "Last Timestamp: " + .lastTimestamp + " Name: " + .involvedObject.name + " Message: " + .message'; done ... env=${env} @@ -72,16 +87,25 @@ List PersistentVolumes in Terminating State ... set_issue_expected=PV should not be stuck terminating. ... set_issue_actual=PV is in a terminating state. ... set_issue_title=PV Events While Terminating In Namespace ${NAMESPACE} - ... set_issue_details=We found "$_line" in the namespace ${NAMESPACE}\nCheck the status of terminating PersistentVolumeClaims over the next few minutes, they should disappear. If not, check that deployments or statefulsets attached to the PersistentVolumeClaims are scaled down and pods attached to the PersistentVolumeClaims are not running. + ... set_issue_details=We found "$_line" in the namespace ${NAMESPACE} + ... set_issue_next_steps=Review `PersistentVolumeClaims` in ${NAMESPACE} after waiting a couple minutes to see if they resolve\nCheck Health of `Deployments` and `StatefulSets` mounting the volumes in `${NAMESPACE}`\nEnsure no `Pods` attached to the `PersistentVolumeClaims` are status=`Running` in namespace `${NAMESPACE}` as this can prevent them from terminating ... _line__raise_issue_if_contains=Name ${history}= RW.CLI.Pop Shell History RW.Core.Add Pre To Report Summary of events for dangling persistent volumes: RW.Core.Add Pre To Report ${dangline_pvcs.stdout} RW.Core.Add Pre To Report Commands Used:\n${history} -List Pods with Attached Volumes and Related PersistentVolume Details +List Pods with Attached Volumes and Related PersistentVolume Details In Namespace `${NAMESPACE}` [Documentation] For each pod in a namespace, collect details on configured PersistentVolumeClaim, PersistentVolume, and node. - [Tags] pod storage pvc pv status csi storagereport check event output and related nodes, PersistentVolumes, PersistentVolumeClaims, image registry authenticaiton, or fluxcd or argocd logs. + [Tags] + ... pod + ... storage + ... pvc + ... pv + ... status + ... csi + ... storagereport + ... check event output and related nodes, persistentvolumes, persistentvolumeclaims, image registry authenticaiton, or fluxcd or argocd logs. ${pod_storage_report}= RW.CLI.Run Cli ... cmd=for pod in $(${KUBERNETES_DISTRIBUTION_BINARY} get pods -n ${NAMESPACE} --field-selector=status.phase=Running --context ${CONTEXT} -o jsonpath='{range .items[*]}{.metadata.name}{"\\n"}{end}'); do for pvc in $(${KUBERNETES_DISTRIBUTION_BINARY} get pods $pod -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{range .spec.volumes[*]}{.persistentVolumeClaim.claimName}{"\\n"}{end}'); do pv=$(${KUBERNETES_DISTRIBUTION_BINARY} get pvc $pvc -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{.spec.volumeName}') && status=$(${KUBERNETES_DISTRIBUTION_BINARY} get pv $pv --context ${CONTEXT} -o jsonpath='{.status.phase}') && node=$(${KUBERNETES_DISTRIBUTION_BINARY} get pod $pod -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{.spec.nodeName}') && zone=$(${KUBERNETES_DISTRIBUTION_BINARY} get nodes $node --context ${CONTEXT} -o jsonpath='{.metadata.labels.topology\\.kubernetes\\.io/zone}') && ingressclass=$(${KUBERNETES_DISTRIBUTION_BINARY} get pvc $pvc -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{.spec.storageClassName}') && accessmode=$(${KUBERNETES_DISTRIBUTION_BINARY} get pvc $pvc -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{.status.accessModes[0]}') && reclaimpolicy=$(${KUBERNETES_DISTRIBUTION_BINARY} get pv $pv --context ${CONTEXT} -o jsonpath='{.spec.persistentVolumeReclaimPolicy}') && csidriver=$(${KUBERNETES_DISTRIBUTION_BINARY} get pv $pv --context ${CONTEXT} -o jsonpath='{.spec.csi.driver}')&& echo -e "\\n---\\nPod: $pod\\nPVC: $pvc\\nPV: $pv\\nStatus: $status\\nNode: $node\\nZone: $zone\\nIngressClass: $ingressclass\\nAccessModes: $accessmode\\nReclaimPolicy: $reclaimpolicy\\nCSIDriver: $csidriver\\n"; done; done ... env=${env} @@ -92,9 +116,18 @@ List Pods with Attached Volumes and Related PersistentVolume Details RW.Core.Add Pre To Report ${pod_storage_report.stdout} RW.Core.Add Pre To Report Commands Used:\n${history} -Fetch the Storage Utilization for PVC Mounts +Fetch the Storage Utilization for PVC Mounts In Namespace `${NAMESPACE}` [Documentation] For each pod in a namespace, fetch the utilization of any PersistentVolumeClaims mounted using the linux df command. Requires kubectl exec permissions. - [Tags] pod storage pvc utilization capacity persistentvolumeclaims persistentvolumeclaim check pvc check event output and related nodes, PersistentVolumes, PersistentVolumeClaims, image registry authenticaiton, or fluxcd or argocd logs. + [Tags] + ... pod + ... storage + ... pvc + ... utilization + ... capacity + ... persistentvolumeclaims + ... persistentvolumeclaim + ... check pvc + ... check event output and related nodes, persistentvolumes, persistentvolumeclaims, image registry authenticaiton, or fluxcd or argocd logs. ${pod_pvc_utilization}= RW.CLI.Run Cli ... cmd=for pod in $(${KUBERNETES_DISTRIBUTION_BINARY} get pods -n ${NAMESPACE} --field-selector=status.phase=Running --context ${CONTEXT} -o jsonpath='{range .items[*]}{.metadata.name}{"\\n"}{end}'); do for pvc in $(${KUBERNETES_DISTRIBUTION_BINARY} get pods $pod -n ${NAMESPACE} --context ${CONTEXT} -o jsonpath='{range .spec.volumes[*]}{.persistentVolumeClaim.claimName}{"\\n"}{end}'); do for volumeName in $(${KUBERNETES_DISTRIBUTION_BINARY} get pod $pod -n ${NAMESPACE} --context ${CONTEXT} -o json | jq -r '.spec.volumes[] | select(has("persistentVolumeClaim")) | .name'); do mountPath=$(${KUBERNETES_DISTRIBUTION_BINARY} get pod $pod -n ${NAMESPACE} --context ${CONTEXT} -o json | jq -r --arg vol "$volumeName" '.spec.containers[].volumeMounts[] | select(.name == $vol) | .mountPath'); containerName=$(${KUBERNETES_DISTRIBUTION_BINARY} get pod $pod -n ${NAMESPACE} --context ${CONTEXT} -o json | jq -r --arg vol "$volumeName" '.spec.containers[] | select(.volumeMounts[].name == $vol) | .name'); echo -e "\\n---\\nPod: $pod, PVC: $pvc, volumeName: $volumeName, containerName: $containerName, mountPath: $mountPath"; ${KUBERNETES_DISTRIBUTION_BINARY} exec $pod -n ${NAMESPACE} --context ${CONTEXT} -c $containerName -- df -h $mountPath; done; done; done; ... env=${env} @@ -116,15 +149,24 @@ Fetch the Storage Utilization for PVC Mounts ... set_issue_title=PVC Storage Utilization As Report by Pod ... set_issue_details=Found excessive PVC Utilization for: \n${unhealthy_volume_capacity.stdout} ... _line__raise_issue_if_contains=Pod - ... set_issue_next_steps=Clean up or expand Persistent Volume Claims for: \n ${unhealthy_volume_list.stdout} + ... set_issue_next_steps=Clean up or expand `PersistentVolumeClaims` in namespace `${NAMESPACE}` for: \n ${unhealthy_volume_list.stdout} ${history}= RW.CLI.Pop Shell History RW.Core.Add Pre To Report Summary of PVC storage mount utilization in ${NAMESPACE}: RW.Core.Add Pre To Report ${pod_pvc_utilization.stdout} RW.Core.Add Pre To Report Commands Used:\n${history} -Check for RWO Persistent Volume Node Attachment Issues - [Documentation] For each pod in a namespace, check if it has an RWO persistent volume claim and if so, validate that the pod and the pv are on the same node. - [Tags] pod storage pvc readwriteonce node persistentvolumeclaims persistentvolumeclaim scheduled attachment +Check for RWO Persistent Volume Node Attachment Issues In Namespace `${NAMESPACE}` + [Documentation] For each pod in a namespace, check if it has an RWO persistent volume claim and if so, validate that the pod and the pv are on the same node. + [Tags] + ... pod + ... storage + ... pvc + ... readwriteonce + ... node + ... persistentvolumeclaims + ... persistentvolumeclaim + ... scheduled + ... attachment ${pod_rwo_node_and_pod_attachment}= RW.CLI.Run Cli ... cmd=NAMESPACE="${NAMESPACE}"; CONTEXT="${CONTEXT}"; PODS=$(kubectl get pods -n $NAMESPACE --context=$CONTEXT -o json); for pod in $(jq -r '.items[] | @base64' <<< "$PODS"); do _jq() { jq -r \${1} <<< "$(base64 --decode <<< \${pod})"; }; POD_NAME=$(_jq '.metadata.name'); POD_NODE_NAME=$(kubectl get pod $POD_NAME -n $NAMESPACE --context=$CONTEXT -o custom-columns=:.spec.nodeName --no-headers); PVC_NAMES=$(kubectl get pod $POD_NAME -n $NAMESPACE --context=$CONTEXT -o jsonpath='{.spec.volumes[*].persistentVolumeClaim.claimName}'); for pvc_name in $PVC_NAMES; do PVC=$(kubectl get pvc $pvc_name -n $NAMESPACE --context=$CONTEXT -o json); ACCESS_MODE=$(jq -r '.spec.accessModes[0]' <<< "$PVC"); if [[ "$ACCESS_MODE" == "ReadWriteOnce" ]]; then PV_NAME=$(jq -r '.spec.volumeName' <<< "$PVC"); STORAGE_NODE_NAME=$(jq -r --arg pv "$PV_NAME" '.items[] | select(.status.volumesAttached != null) | select(.status.volumesInUse[] | contains($pv)) | .metadata.name' <<< "$(kubectl get nodes --context=$CONTEXT -o json)"); echo "-----"; if [[ "$POD_NODE_NAME" == "$STORAGE_NODE_NAME" ]]; then echo "OK: Pod and Storage Node Matched"; else echo "Error: Pod and Storage Node Mismatched - If the issue persists, the node requires attention."; fi; echo "Pod: $POD_NAME"; echo "PVC: $pvc_name"; echo "PV: $PV_NAME"; echo "Node with Pod: $POD_NODE_NAME"; echo "Node with Storage: $STORAGE_NODE_NAME"; echo; fi; done; done ... env=${env} @@ -137,9 +179,11 @@ Check for RWO Persistent Volume Node Attachment Issues ... set_issue_actual=Pods with RWO found on a different node than their RWO storage: ${NAMESPACE} ... set_issue_title=Pods with RWO storage might not have storage scheduling issues for namespace: ${NAMESPACE} ... set_issue_details=All Pods and RWO their storage details are:\n\n$_stdout\n\n + ... set_issue_next_steps=List `Pods` in namespace `${NAMESPACE}` and review the `Nodes` they're scheduled on\nReview Kubernetes `Scheduler` logs\nCheck `Node Affinity` and `Taints/Tolerations` ... _line__raise_issue_if_contains=Error ${history}= RW.CLI.Pop Shell History - RW.Core.Add Pre To Report Summary of Pods with RWO storage and the nodes their scheduling details for namespace: ${NAMESPACE}: + RW.Core.Add Pre To Report + ... Summary of Pods with RWO storage and the nodes their scheduling details for namespace: ${NAMESPACE}: RW.Core.Add Pre To Report ${pod_rwo_node_and_pod_attachment.stdout} RW.Core.Add Pre To Report Commands Used:\n${history} diff --git a/codebundles/k8s-statefulset-healthcheck/runbook.robot b/codebundles/k8s-statefulset-healthcheck/runbook.robot index d2d70d5a..4821543b 100644 --- a/codebundles/k8s-statefulset-healthcheck/runbook.robot +++ b/codebundles/k8s-statefulset-healthcheck/runbook.robot @@ -14,7 +14,7 @@ Suite Setup Suite Initialization *** Tasks *** -Fetch StatefulSet Logs +Fetch StatefulSet `${STATEFULSET_NAME}` Logs [Documentation] Fetches the last 100 lines of logs for the given statefulset in the namespace. [Tags] fetch log pod container errors inspect trace info statefulset ${logs}= RW.CLI.Run Cli @@ -26,7 +26,7 @@ Fetch StatefulSet Logs RW.Core.Add Pre To Report ${logs.stdout} RW.Core.Add Pre To Report Commands Used: ${history} -Get Related StatefulSet Events +Get Related StatefulSet `${STATEFULSET_NAME}` Events [Documentation] Fetches events related to the StatefulSet workload in the namespace. [Tags] events workloads errors warnings get statefulset ${events}= RW.CLI.Run Cli @@ -38,7 +38,7 @@ Get Related StatefulSet Events RW.Core.Add Pre To Report ${events.stdout} RW.Core.Add Pre To Report Commands Used: ${history} -Fetch StatefulSet Manifest Details +Fetch StatefulSet `${STATEFULSET_NAME}` Manifest Details [Documentation] Fetches the current state of the statefulset manifest for inspection. [Tags] statefulset details manifest info ${statefulset}= RW.CLI.Run Cli @@ -50,7 +50,7 @@ Fetch StatefulSet Manifest Details RW.Core.Add Pre To Report ${statefulset.stdout} RW.Core.Add Pre To Report Commands Used: ${history} -List StatefulSets with Unhealthy Replica Counts +List StatefulSets with Unhealthy Replica Counts In Namespace `${NAMESPACE}` [Documentation] Pulls the replica information for a given StatefulSet and checks if it's highly available ... , if the replica counts are the expected / healthy values, and if not, what they should be. [Tags]