Increase the appropriate resource limits for pods after determining if a pod is being CPU throttled or OOMKilled.
Return Kubernetes pods to a healthy state with resources available.
kubectl
is installed.- The names of the pods hitting their resource limits are known. See Determine if Pods are Hitting Resource Limits.
-
Determine the current limits of a pod.
In the example below, cray-hbtd-etcd-8r2scmpb58 is the POD_ID being used.
ncn-w001# kubectl get po -n services POD_ID -o yaml ... **(look for this section)** resources: limits: cpu: "2" memory: 2Gi requests: cpu: 10m memory: 64Mi
-
Determine which Kubernetes entity (etcdcluster, deployment, statefulset) is creating the pod.
The Kubernetes entity can be found with either of the following options:
-
Find the Kubernetes entity and grep for the pod in question.
Replace hbtd-etcd with the pod being used.
ncn-w001# kubectl get deployment,statefulset,etcdcluster,postgresql,daemonsets \ -A | grep hbtd-etcd services etcdcluster.etcd.database.coreos.com/cray-hbtd-etcd 32d
-
Describe the pod and look in the Labels section.
This section is helpful for tracking down which entity is creating the pod.
ncn-w001# kubectl describe pod -n services POD_ID . . . Labels: app=etcd etcd_cluster=cray-hbtd-etcd etcd_node=cray-hbtd-etcd-8r2scmpb58 . .
-
-
Edit the entity.
In the example below, the ENTITY is etcdcluster and the CLUSTER_NAME is cray-hbtd-etcd.
ncn-w001# kubectl edit ENTITY -n services CLUSTER_NAME
-
Increase the resource limits for the pod.
resources: {}
Replace the text above with the following section, increasing the limits value(s):
resources: limits: cpu: "4" memory: 8Gi requests: cpu: 10m memory: 64Mi
-
Run a rolling restart of the pods.
ncn-w001# kubectl get po -n services | grep CLUSTER_NAME cray-hbtd-etcd-8r2scmpb58 1/1 Running 0 5d11h cray-hbtd-etcd-qvz4zzjzw2 1/1 Running 0 5d11h cray-hbtd-etcd-vzjzmbn6nr 1/1 Running 0 5d11h
-
Kill the pods off one by one.
ncn-w001# kubectl -n services delete pod POD_ID
-
Wait for a replacement pod to come up and be in a Running state before proceeding to the next pod.
They should all be running with a more recent age.
ncn-w001# kubectl get po -n services | grep CLUSTER_NAME cray-hbtd-etcd-8r2scmpb58 1/1 Running 0 12s cray-hbtd-etcd-qvz4zzjzw2 1/1 Running 0 32s cray-hbtd-etcd-vzjzmbn6nr 1/1 Running 0 98s