Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Redis keeps crashing and cannot recover from memory/io fault #5095

Closed
ahjing99 opened this issue Sep 12, 2023 · 4 comments
Closed

[BUG]Redis keeps crashing and cannot recover from memory/io fault #5095

ahjing99 opened this issue Sep 12, 2023 · 4 comments
Assignees
Labels
bug kind/bug Something isn't working Stale
Milestone

Comments

@ahjing99
Copy link
Collaborator

➜ ~ kbcli version
Kubernetes: v1.27.3-gke.100
KubeBlocks: 0.7.0-alpha.8
kbcli: 0.7.0-alpha.8


      `kbcli cluster create  cluster-hihvoo             --termination-policy=WipeOut             --monitoring-interval=0 --enable-all-logs=false --cluster-definition=redis --set type=redis,cpu=m,memory=Gi,replicas=2,storage=5Gi --set type=redis-sentinel,cpu=m,memory=Gi,storage=5Gi,replicas=3  --namespace default `

Info: --cluster-version is not specified, ClusterVersion redis-7.0.6 is applied by default
Cluster cluster-hihvoo created


      `kbcli fault stress --cpu-worker=5 --memory-worker=5 --memory-size=20Gi cluster-hihvoo-redis-0 --ns-fault=default --duration=2m`

StressChaos stress-chaos-64gxw created

➜  ~ k describe cluster cluster-hihvoo
Name:         cluster-hihvoo
Namespace:    default
Labels:       clusterdefinition.kubeblocks.io/name=redis
              clusterversion.kubeblocks.io/name=redis-7.0.6
Annotations:  kubeblocks.io/reconcile: 2023-09-12T07:07:20.998507843Z
API Version:  apps.kubeblocks.io/v1alpha1
Kind:         Cluster
Metadata:
  Creation Timestamp:  2023-09-12T06:47:17Z
  Finalizers:
    cluster.kubeblocks.io/finalizer
  Generation:  1
  Managed Fields:
    API Version:  apps.kubeblocks.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:affinity:
          .:
          f:nodeLabels:
          f:podAntiAffinity:
          f:tenancy:
          f:topologyKeys:
        f:backup:
          .:
          f:enabled:
          f:method:
          f:pitrEnabled:
          f:retentionPeriod:
        f:clusterDefinitionRef:
        f:clusterVersionRef:
        f:componentSpecs:
          .:
          k:{"name":"redis"}:
            .:
            f:componentDefRef:
            f:monitor:
            f:name:
            f:noCreatePDB:
            f:replicas:
            f:resources:
              .:
              f:limits:
                .:
                f:cpu:
                f:memory:
              f:requests:
                .:
                f:cpu:
                f:memory:
            f:serviceAccountName:
            f:switchPolicy:
              .:
              f:type:
            f:volumeClaimTemplates:
          k:{"name":"redis-sentinel"}:
            .:
            f:componentDefRef:
            f:monitor:
            f:name:
            f:noCreatePDB:
            f:replicas:
            f:resources:
              .:
              f:limits:
                .:
                f:cpu:
                f:memory:
              f:requests:
                .:
                f:cpu:
                f:memory:
            f:serviceAccountName:
            f:volumeClaimTemplates:
        f:terminationPolicy:
        f:tolerations:
    Manager:      kbcli
    Operation:    Update
    Time:         2023-09-12T06:47:17Z
    API Version:  apps.kubeblocks.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:clusterDefGeneration:
        f:components:
          .:
          f:redis:
            .:
            f:phase:
            f:podsReady:
            f:replicationSetStatus:
              .:
              f:primary:
                .:
                f:pod:
              f:secondaries:
          f:redis-sentinel:
            .:
            f:phase:
            f:podsReady:
            f:podsReadyTime:
        f:conditions:
        f:observedGeneration:
        f:phase:
    Manager:      manager
    Operation:    Update
    Subresource:  status
    Time:         2023-09-12T06:58:41Z
    API Version:  apps.kubeblocks.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubeblocks.io/reconcile:
        f:finalizers:
          .:
          v:"cluster.kubeblocks.io/finalizer":
        f:labels:
          .:
          f:clusterdefinition.kubeblocks.io/name:
          f:clusterversion.kubeblocks.io/name:
    Manager:         manager
    Operation:       Update
    Time:            2023-09-12T07:07:21Z
  Resource Version:  1022895
  UID:               a3d3aa7f-de79-44ba-9024-f849403cc985
Spec:
  Affinity:
    Node Labels:
    Pod Anti Affinity:  Preferred
    Tenancy:            SharedNode
    Topology Keys:
  Backup:
    Enabled:               false
    Method:                snapshot
    Pitr Enabled:          false
    Retention Period:      1d
  Cluster Definition Ref:  redis
  Cluster Version Ref:     redis-7.0.6
  Component Specs:
    Component Def Ref:  redis
    Monitor:            false
    Name:               redis
    No Create PDB:      false
    Replicas:           2
    Resources:
      Limits:
        Cpu:     0
        Memory:  0
      Requests:
        Cpu:               0
        Memory:            0
    Service Account Name:  kb-cluster-hihvoo
    Switch Policy:
      Type:  Noop
    Volume Claim Templates:
      Name:  data
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:    5Gi
    Component Def Ref:  redis-sentinel
    Monitor:            false
    Name:               redis-sentinel
    No Create PDB:      false
    Replicas:           3
    Resources:
      Limits:
        Cpu:     0
        Memory:  0
      Requests:
        Cpu:               0
        Memory:            0
    Service Account Name:  kb-cluster-hihvoo
    Volume Claim Templates:
      Name:  data
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:   5Gi
  Termination Policy:  WipeOut
  Tolerations:
Status:
  Cluster Def Generation:  2
  Components:
    Redis:
      Phase:       Abnormal
      Pods Ready:  false
      Replication Set Status:
        Primary:
          Pod:  cluster-hihvoo-redis-1
        Secondaries:
          Pod:  cluster-hihvoo-redis-0
    Redis - Sentinel:
      Phase:            Running
      Pods Ready:       true
      Pods Ready Time:  2023-09-12T06:48:42Z
  Conditions:
    Last Transition Time:  2023-09-12T06:47:17Z
    Message:               The operator has started the provisioning of Cluster: cluster-hihvoo
    Observed Generation:   1
    Reason:                PreCheckSucceed
    Status:                True
    Type:                  ProvisioningStarted
    Last Transition Time:  2023-09-12T06:58:41Z
    Message:               Successfully applied for resources
    Observed Generation:   1
    Reason:                ApplyResourcesSucceed
    Status:                True
    Type:                  ApplyResources
    Last Transition Time:  2023-09-12T06:58:41Z
    Message:               pods are not ready in Components: [redis], refer to related component message in Cluster.status.components
    Reason:                ReplicasNotReady
    Status:                False
    Type:                  ReplicasReady
    Last Transition Time:  2023-09-12T06:58:41Z
    Message:               pods are unavailable in Components: [redis], refer to related component message in Cluster.status.components
    Reason:                ComponentsNotReady
    Status:                False
    Type:                  Ready
  Observed Generation:     1
  Phase:                   Abnormal
Events:
  Type     Reason                    Age                    From                       Message
  ----     ------                    ----                   ----                       -------
  Normal   PreCheckSucceed           21m                    cluster-controller         The operator has started the provisioning of Cluster: cluster-hihvoo
  Normal   ApplyResourcesSucceed     21m                    cluster-controller         Successfully applied for resources
  Normal   ComponentPhaseTransition  21m (x2 over 21m)      cluster-controller         Create a new component
  Warning  ApplyResourcesFailed      21m (x10 over 21m)     cluster-controller         the number of primary pod is not equal to 1, primary pods: [cluster-hihvoo-redis-1 cluster-hihvoo-redis-0], emptyRole pods: []
  Warning  ApplyResourcesFailed      21m (x11 over 21m)     cluster-controller         the number of primary pod is not equal to 1, primary pods: [cluster-hihvoo-redis-0 cluster-hihvoo-redis-1], emptyRole pods: []
  Normal   SysAcctCreate             19m                    system-account-controller  Created accounts for cluster: cluster-hihvoo, component: redis, accounts: kbadmin
  Normal   SysAcctCreate             19m                    system-account-controller  Created accounts for cluster: cluster-hihvoo, component: redis, accounts: kbmonitoring
  Normal   SysAcctCreate             19m                    system-account-controller  Created accounts for cluster: cluster-hihvoo, component: redis, accounts: kbdataprotection
  Normal   SysAcctCreate             19m                    system-account-controller  Created accounts for cluster: cluster-hihvoo, component: redis, accounts: kbprobe
  Normal   WaitingForProbeSuccess    13m                    cluster-controller         Waiting for probe success
  Warning  ApplyResourcesFailed      10m                    cluster-controller         the number of primary pod is not equal to 1, primary pods: [], emptyRole pods: []
  Warning  BackOff                   8m17s                  event-controller           Pod cluster-hihvoo-redis-0: Back-off restarting failed container redis in pod cluster-hihvoo-redis-0_default(6d317472-73b3-4f43-b9cb-c5d3b2289edd)
  Warning  Unhealthy                 3m29s (x2 over 3m29s)  event-controller           Pod cluster-hihvoo-redis-0: Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"secondary","role":"secondary"}
  Warning  Unhealthy                 84s (x12 over 17m)     event-controller           Pod cluster-hihvoo-redis-1: Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"primary","role":"primary"}
➜  ~

➜  ~ k describe StressChaos stress-chaos-64gxw
Name:         stress-chaos-64gxw
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  chaos-mesh.org/v1alpha1
Kind:         StressChaos
Metadata:
  Creation Timestamp:  2023-09-12T06:58:22Z
  Finalizers:
    chaos-mesh/records
  Generate Name:  stress-chaos-
  Generation:     8
  Managed Fields:
    API Version:  chaos-mesh.org/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
      f:spec:
        .:
        f:duration:
        f:mode:
        f:selector:
          .:
          f:namespaces:
          f:pods:
            .:
            f:default:
        f:stressors:
          .:
          f:cpu:
            .:
            f:load:
            f:workers:
          f:memory:
            .:
            f:oomScoreAdj:
            f:size:
            f:workers:
      f:status:
        .:
        f:experiment:
    Manager:      kbcli
    Operation:    Update
    Time:         2023-09-12T06:58:22Z
    API Version:  chaos-mesh.org/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"chaos-mesh/records":
      f:status:
        f:conditions:
        f:experiment:
          f:containerRecords:
          f:desiredPhase:
    Manager:         chaos-controller-manager
    Operation:       Update
    Time:            2023-09-12T07:00:22Z
  Resource Version:  1018488
  UID:               7c8c9af1-ca16-4b3e-8165-e31ffeab0205
Spec:
  Duration:  2m
  Mode:      all
  Selector:
    Namespaces:
      default
    Pods:
      Default:
        cluster-hihvoo-redis-0
  Stressors:
    Cpu:
      Load:     0
      Workers:  5
    Memory:
      Oom Score Adj:  0
      Size:           20Gi
      Workers:        5
Status:
  Conditions:
    Status:  True
    Type:    Selected
    Status:  False
    Type:    AllInjected
    Status:  True
    Type:    AllRecovered
    Status:  False
    Type:    Paused
  Experiment:
    Container Records:
      Events:
        Operation:      Apply
        Timestamp:      2023-09-12T06:58:22Z
        Type:           Succeeded
        Operation:      Recover
        Timestamp:      2023-09-12T07:00:22Z
        Type:           Succeeded
      Id:               default/cluster-hihvoo-redis-0/redis
      Injected Count:   1
      Phase:            Not Injected
      Recovered Count:  1
      Selector Key:     .
    Desired Phase:      Stop
Events:
  Type    Reason           Age   From            Message
  ----    ------           ----  ----            -------
  Normal  FinalizerInited  11m   initFinalizers  Finalizer has been inited
  Normal  Updated          11m   initFinalizers  Successfully update finalizer of resource
  Normal  Started          11m   desiredphase    Experiment has started
  Normal  Updated          11m   desiredphase    Successfully update desiredPhase of resource
  Normal  Applied          11m   records         Successfully apply chaos for default/cluster-hihvoo-redis-0/redis
  Normal  Updated          11m   records         Successfully update records of resource
  Normal  TimeUp           9m3s  desiredphase    Time up according to the duration
  Normal  Updated          9m3s  desiredphase    Successfully update desiredPhase of resource
  Normal  Recovered        9m3s  records         Successfully recover chaos for default/cluster-hihvoo-redis-0/redis
  Normal  Updated          9m3s  records         Successfully update records of resource

➜  ~ k get pod | grep redis
cluster-hihvoo-redis-0                          4/5     CrashLoopBackOff   7 (20s ago)   14m
cluster-hihvoo-redis-1                          5/5     Running            0             14m
cluster-hihvoo-redis-sentinel-0                 1/1     Running            0             22m
cluster-hihvoo-redis-sentinel-1                 1/1     Running            0             21m
cluster-hihvoo-redis-sentinel-2                 1/1     Running            0             21m


➜  ~ k describe pod cluster-hihvoo-redis-0
Name:         cluster-hihvoo-redis-0
Namespace:    default
Priority:     0
Node:         gke-yjtest-default-pool-47e27321-h6tl/10.128.15.203
Start Time:   Tue, 12 Sep 2023 14:54:50 +0800
Labels:       app.kubernetes.io/component=redis
              app.kubernetes.io/instance=cluster-hihvoo
              app.kubernetes.io/managed-by=kubeblocks
              app.kubernetes.io/name=redis
              app.kubernetes.io/version=redis-7.0.6
              apps.kubeblocks.io/component-name=redis
              apps.kubeblocks.io/workload-type=Replication
              controller-revision-hash=cluster-hihvoo-redis-55fc6b6d47
              kubeblocks.io/role=secondary
              rsm.workloads.kubeblocks.io/access-mode=Readonly
              statefulset.kubernetes.io/pod-name=cluster-hihvoo-redis-0
Annotations:  apps.kubeblocks.io/component-replicas: 2
              apps.kubeblocks.io/last-role-changed-event-timestamp: 2023-09-12T06:56:16Z
              rs.apps.kubeblocks.io/primary: cluster-hihvoo-redis-1
Status:       Running
IP:           10.104.2.28
IPs:
  IP:           10.104.2.28
Controlled By:  StatefulSet/cluster-hihvoo-redis
Init Containers:
  role-agent-installer:
    Container ID:  containerd://d08c24df3201fc219e55b66f888967bfdda7431ef1c861e92aa4532ad7af17c1
    Image:         msoap/shell2http:1.16.0
    Image ID:      docker.io/msoap/shell2http@sha256:a20bdde2f679de2cba6bf3d9f470489c7836d4d0d28232a2b295450809cd43ef
    Port:          <none>
    Host Port:     <none>
    Command:
      cp
      /app/shell2http
      /role-probe/agent
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 12 Sep 2023 14:55:14 +0800
      Finished:     Tue, 12 Sep 2023 14:55:14 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /role-probe from role-agent (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v7j4m (ro)
Containers:
  redis:
    Container ID:  containerd://a3f9618fe808a2f0ac962ee85201fa3e9b3792bad01a4728a71bbd2b5d2bda2d
    Image:         registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8
    Image ID:      registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server@sha256:511808b267ab8d800283604ef5c01f4fe94792bfb746bb6dba236cc29ff5495b
    Port:          6379/TCP
    Host Port:     0/TCP
    Command:
      /scripts/redis-start.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 12 Sep 2023 15:09:28 +0800
      Finished:     Tue, 12 Sep 2023 15:09:28 +0800
    Ready:          False
    Restart Count:  7
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:      0
      memory:   0
    Readiness:  exec [sh -c /scripts/redis-ping.sh 1] delay=10s timeout=1s period=5s #success=1 #failure=5
    Environment Variables from:
      cluster-hihvoo-redis-env      ConfigMap  Optional: false
      cluster-hihvoo-redis-rsm-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:               cluster-hihvoo-redis-0 (v1:metadata.name)
      KB_POD_UID:                 (v1:metadata.uid)
      KB_NAMESPACE:              default (v1:metadata.namespace)
      KB_SA_NAME:                 (v1:spec.serviceAccountName)
      KB_NODENAME:                (v1:spec.nodeName)
      KB_HOST_IP:                 (v1:status.hostIP)
      KB_POD_IP:                  (v1:status.podIP)
      KB_POD_IPS:                 (v1:status.podIPs)
      KB_HOSTIP:                  (v1:status.hostIP)
      KB_PODIP:                   (v1:status.podIP)
      KB_PODIPS:                  (v1:status.podIPs)
      KB_CLUSTER_NAME:           cluster-hihvoo
      KB_COMP_NAME:              redis
      KB_CLUSTER_COMP_NAME:      cluster-hihvoo-redis
      KB_CLUSTER_UID_POSTFIX_8:  403cc985
      KB_POD_FQDN:               $(KB_POD_NAME).$(KB_CLUSTER_COMP_NAME)-headless.$(KB_NAMESPACE).svc
      REDIS_REPL_USER:           kbreplicator
      REDIS_REPL_PASSWORD:       <set to the key 'password' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      REDIS_DEFAULT_PASSWORD:    <set to the key 'password' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      REDIS_SENTINEL_USER:       $(REDIS_REPL_USER)-sentinel
      REDIS_SENTINEL_PASSWORD:   <set to the key 'password' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      REDIS_ARGS:                --requirepass $(REDIS_PASSWORD)
    Mounts:
      /data from data (rw)
      /etc/conf from redis-config (rw)
      /etc/redis from redis-conf (rw)
      /kb-podinfo from pod-info (rw)
      /scripts from scripts (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v7j4m (ro)
  metrics:
    Container ID:  containerd://884406684194cda30cd5e2fc07ad61cc1b32da54ae1f3b50b9cab9296d042249
    Image:         registry.cn-hangzhou.aliyuncs.com/apecloud/agamotto:0.1.2-beta.1
    Image ID:      registry.cn-hangzhou.aliyuncs.com/apecloud/agamotto@sha256:cbab349b90490807a8d5039bf01bc7e37334f20c98c7dd75bc7fc4cf9e5b10ee
    Port:          9121/TCP
    Host Port:     0/TCP
    Command:
      /bin/agamotto
      --config=/opt/conf/metrics-config.yaml
    State:          Running
      Started:      Tue, 12 Sep 2023 14:55:16 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:     0
      memory:  0
    Environment Variables from:
      cluster-hihvoo-redis-env      ConfigMap  Optional: false
      cluster-hihvoo-redis-rsm-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:               cluster-hihvoo-redis-0 (v1:metadata.name)
      KB_POD_UID:                 (v1:metadata.uid)
      KB_NAMESPACE:              default (v1:metadata.namespace)
      KB_SA_NAME:                 (v1:spec.serviceAccountName)
      KB_NODENAME:                (v1:spec.nodeName)
      KB_HOST_IP:                 (v1:status.hostIP)
      KB_POD_IP:                  (v1:status.podIP)
      KB_POD_IPS:                 (v1:status.podIPs)
      KB_HOSTIP:                  (v1:status.hostIP)
      KB_PODIP:                   (v1:status.podIP)
      KB_PODIPS:                  (v1:status.podIPs)
      KB_CLUSTER_NAME:           cluster-hihvoo
      KB_COMP_NAME:              redis
      KB_CLUSTER_COMP_NAME:      cluster-hihvoo-redis
      KB_CLUSTER_UID_POSTFIX_8:  403cc985
      KB_POD_FQDN:               $(KB_POD_NAME).$(KB_CLUSTER_COMP_NAME)-headless.$(KB_NAMESPACE).svc
      ENDPOINT:                  localhost:6379
      REDIS_USER:                <set to the key 'username' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      REDIS_PASSWORD:            <set to the key 'password' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
    Mounts:
      /opt/conf from redis-metrics-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v7j4m (ro)
  kb-checkrole:
    Container ID:  containerd://384607d659bcc676dd30116b7b5a65601be965e75744077edcd26d248415f1ce
    Image:         registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.7.0-alpha.8
    Image ID:      registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools@sha256:70fc1072a6bfd03a31e0fe83377487271e7443dc7f52d5bca1af0a203ba3b96e
    Ports:         3501/TCP, 50001/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      lorry
      --app-id
      batch-sdk
      --dapr-http-port
      3501
      --dapr-grpc-port
      50001
      --log-level
      info
      --config
      /config/lorry/config.yaml
      --components-path
      /config/lorry/components
    State:          Running
      Started:      Tue, 12 Sep 2023 14:55:16 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:      0
      memory:   0
    Readiness:  http-get http://:3501/v1.0/bindings/redis%3Foperation=checkRole&workloadType=Replication delay=0s timeout=1s period=2s #success=1 #failure=2
    Startup:    tcp-socket :3501 delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment Variables from:
      cluster-hihvoo-redis-env      ConfigMap  Optional: false
      cluster-hihvoo-redis-rsm-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:                cluster-hihvoo-redis-0 (v1:metadata.name)
      KB_POD_UID:                  (v1:metadata.uid)
      KB_NAMESPACE:               default (v1:metadata.namespace)
      KB_SA_NAME:                  (v1:spec.serviceAccountName)
      KB_NODENAME:                 (v1:spec.nodeName)
      KB_HOST_IP:                  (v1:status.hostIP)
      KB_POD_IP:                   (v1:status.podIP)
      KB_POD_IPS:                  (v1:status.podIPs)
      KB_HOSTIP:                   (v1:status.hostIP)
      KB_PODIP:                    (v1:status.podIP)
      KB_PODIPS:                   (v1:status.podIPs)
      KB_CLUSTER_NAME:            cluster-hihvoo
      KB_COMP_NAME:               redis
      KB_CLUSTER_COMP_NAME:       cluster-hihvoo-redis
      KB_CLUSTER_UID_POSTFIX_8:   403cc985
      KB_POD_FQDN:                $(KB_POD_NAME).$(KB_CLUSTER_COMP_NAME)-headless.$(KB_NAMESPACE).svc
      KB_SERVICE_USER:            <set to the key 'username' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      KB_SERVICE_PASSWORD:        <set to the key 'password' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      KB_SERVICE_PORT:            6379
      KB_SERVICE_ROLES:           {}
      KB_SERVICE_CHARACTER_TYPE:  redis
      KB_WORKLOAD_TYPE:           Replication
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v7j4m (ro)
  action-0:
    Container ID:  containerd://c704acecc87fd52805fa8e8e98051c63058a609e0a91adfc099d98a7dca4cdd1
    Image:         registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8
    Image ID:      registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server@sha256:511808b267ab8d800283604ef5c01f4fe94792bfb746bb6dba236cc29ff5495b
    Port:          <none>
    Host Port:     <none>
    Command:
      /role-probe/agent
      -port
      36501
      -export-all-vars
      -form
      /role
      Role=$(redis-cli --user $KB_RSM_USERNAME --pass $KB_RSM_PASSWORD --no-auth-warning info | grep role | awk -F ':' '{print $2}' | tr '[:upper:]' '[:low' | tr -d '-d '
      ') && if [ "master" = "$Role" ]; then echo -n "primary"; else echo -n "secondary"; fi
    State:          Running
      Started:      Tue, 12 Sep 2023 14:55:16 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      KB_RSM_USERNAME:  <set to the key 'username' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      KB_RSM_PASSWORD:  <set to the key 'password' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
    Mounts:
      /role-probe from role-agent (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v7j4m (ro)
  role-observe:
    Container ID:  containerd://ddaf42c9fc47398a81959a8390d99d67d53043d28962d777baed3396af94da14
    Image:         apecloud/kubeblocks-role-agent:latest
    Image ID:      docker.io/apecloud/kubeblocks-role-agent@sha256:094c90431b37fbdae13a85b491628fb05394f00de423a5686141ec63867181c2
    Port:          7373/TCP
    Host Port:     0/TCP
    Command:
      role-agent
      --port
      7373
    State:          Running
      Started:      Tue, 12 Sep 2023 14:55:16 +0800
    Ready:          True
    Restart Count:  0
    Readiness:      exec [/bin/grpc_health_probe -addr=localhost:7373] delay=0s timeout=1s period=2s #success=1 #failure=2
    Environment:
      KB_RSM_USERNAME:         <set to the key 'username' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      KB_RSM_PASSWORD:         <set to the key 'password' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      KB_RSM_ACTION_SVC_LIST:  [36501]
      KB_SERVICE_USER:         <set to the key 'username' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      KB_SERVICE_PASSWORD:     <set to the key 'password' in secret 'cluster-hihvoo-conn-credential'>  Optional: false
      KB_RSM_SERVICE_PORT:     6379
      KB_SERVICE_PORT:         6379
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v7j4m (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-cluster-hihvoo-redis-0
    ReadOnly:   false
  pod-info:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels['kubeblocks.io/role'] -> pod-role
      metadata.annotations['rs.apps.kubeblocks.io/primary'] -> primary-pod
      metadata.annotations['apps.kubeblocks.io/component-replicas'] -> component-replicas
  redis-metrics-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cluster-hihvoo-redis-redis-metrics-config
    Optional:  false
  redis-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cluster-hihvoo-redis-redis-replication-config
    Optional:  false
  scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cluster-hihvoo-redis-redis-scripts
    Optional:  false
  redis-conf:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  role-agent:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-v7j4m:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 kb-data=true:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               15m                   default-scheduler        Successfully assigned default/cluster-hihvoo-redis-0 to gke-yjtest-default-pool-47e27321-h6tl
  Warning  FailedAttachVolume      15m                   attachdetach-controller  Multi-Attach error for volume "pvc-cdb7cbf2-ed2e-4efe-ae54-0ec0cca290d0" Volume is already exclusively attached to one node and can't be attached to another
  Normal   SuccessfulAttachVolume  14m                   attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-cdb7cbf2-ed2e-4efe-ae54-0ec0cca290d0"
  Normal   Pulled                  14m                   kubelet                  Container image "msoap/shell2http:1.16.0" already present on machine
  Normal   Created                 14m                   kubelet                  Created container role-agent-installer
  Normal   Started                 14m                   kubelet                  Started container role-agent-installer
  Normal   Pulled                  14m                   kubelet                  Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/agamotto:0.1.2-beta.1" already present on machine
  Normal   Created                 14m                   kubelet                  Created container redis
  Normal   Started                 14m                   kubelet                  Started container redis
  Normal   Started                 14m                   kubelet                  Started container role-observe
  Normal   Created                 14m                   kubelet                  Created container metrics
  Normal   Started                 14m                   kubelet                  Started container metrics
  Normal   Pulled                  14m                   kubelet                  Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.7.0-alpha.8" already present on machine
  Normal   Created                 14m                   kubelet                  Created container kb-checkrole
  Normal   Started                 14m                   kubelet                  Started container kb-checkrole
  Warning  Unhealthy               14m                   kubelet                  Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"","role":"primary"}
  Normal   Created                 14m                   kubelet                  Created container action-0
  Normal   Started                 14m                   kubelet                  Started container action-0
  Normal   Pulled                  14m                   kubelet                  Container image "apecloud/kubeblocks-role-agent:latest" already present on machine
  Normal   Created                 14m                   kubelet                  Created container role-observe
  Normal   Pulled                  14m                   kubelet                  Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8" already present on machine
  Normal   checkRole               14m                   sqlchannel               {"event":"Failed","message":"role check delay","operation":"checkRole","originalRole":""}
  Normal   checkRole               13m                   sqlchannel               {"event":"Success","operation":"checkRole","originalRole":"","role":"primary"}
  Warning  Unhealthy               12m                   kubelet                  Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"primary","role":"primary"}
  Normal   Pulled                  11m (x2 over 14m)     kubelet                  Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8" already present on machine
  Warning  Unhealthy               11m (x2 over 11m)     kubelet                  Readiness probe failed: Get "http://10.104.2.28:3501/v1.0/bindings/redis?operation=checkRole&workloadType=Replication": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy               11m                   kubelet                  Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"primary","role":"secondary"}
  Normal   checkRole               11m                   sqlchannel               {"event":"Failed","message":"dial tcp 127.0.0.1:6379: connect: connection refused","operation":"checkRole","originalRole":"primary"}
  Warning  Unhealthy               11m                   kubelet                  Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "3ef7d75853f53f379f4ff009119491adef3b5bad25dd19a62dd0960450bcd158": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown
  Warning  BackOff                 9m34s (x15 over 11m)  kubelet                  Back-off restarting failed container redis in pod cluster-hihvoo-redis-0_default(6d317472-73b3-4f43-b9cb-c5d3b2289edd)
  Warning  Unhealthy               4m46s (x4 over 10m)   kubelet                  Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"secondary","role":"secondary"}


➜  ~ k logs cluster-hihvoo-redis-0
Defaulted container "redis" out of: redis, metrics, kb-checkrole, action-0, role-observe, role-agent-installer (init)
+ echo include /etc/conf/redis.conf
+ echo replica-announce-ip cluster-hihvoo-redis-0.cluster-hihvoo-redis-headless.default.svc
+ [ -f /data/users.acl ]
+ sed -i /user default on/d /data/users.acl
+ sed -i /user kbreplicator on/d /data/users.acl
+ sed -i /user kbreplicator-sentinel on/d /data/users.acl
+ [ ! -z gx4kfkcq ]
+ echo masteruser kbreplicator
+ echo masterauth gx4kfkcq
+ echo user kbreplicator on +psync +replconf +ping >gx4kfkcq
+ [ ! -z gx4kfkcq ]
+ echo user kbreplicator-sentinel on allchannels +multi +slaveof +ping +exec +subscribe +config|rewrite +role +publish +info +client|setname +client|kill +script|kill >gx4kfkcq
+ [ ! -z gx4kfkcq ]
+ echo protected-mode yes
+ echo user default on allcommands allkeys >gx4kfkcq
+ echo aclfile /data/users.acl
+ start_redis_server
+ exec redis-server /etc/redis/redis.conf --loadmodule /opt/redis-stack/lib/redisearch.so --loadmodule /opt/redis-stack/lib/redisgraph.so --loadmodule /opt/redis-stack/lib/redistimeseries.so --loadmodule /opt/redis-stack/lib/rejson.so --loadmodule /opt/redis-stack/lib/redisbloom.so
+ create_replication
+ [ ! -z gx4kfkcq ]
+ retry redis-cli -h 127.0.0.1 -p 6379 -a gx4kfkcq ping
+ local max_attempts=20
+ local attempt=1
+ redis-cli -h 127.0.0.1 -p 6379 -a gx4kfkcq ping
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
Could not connect to Redis at 127.0.0.1:6379: Connection refused
+ [ 1 -eq 20 ]
+ echo Command 'redis-cli -h 127.0.0.1 -p 6379 -a gx4kfkcq ping' failed. Attempt 1 of 20. Retrying in 5 seconds...
Command 'redis-cli -h 127.0.0.1 -p 6379 -a gx4kfkcq ping' failed. Attempt 1 of 20. Retrying in 5 seconds...
+ attempt=2
+ sleep 3


@ahjing99 ahjing99 added the kind/bug Something isn't working label Sep 12, 2023
@ahjing99 ahjing99 added this to the Release 0.7.0 milestone Sep 12, 2023
@ahjing99
Copy link
Collaborator Author

The pod cannot recover after inject io fault

`kbcli fault io errno cluster-oqroov-redis-0 --ns-fault=default --volume-path=/data  --errno=28 --duration=2m`

IOChaos io-chaos-jlk9n created

 k describe pod cluster-oqroov-redis-0
Name:         cluster-oqroov-redis-0
Namespace:    default
Priority:     0
Node:         gke-yjtest-default-pool-47e27321-h6tl/10.128.15.203
Start Time:   Wed, 13 Sep 2023 09:49:31 +0800
Labels:       app.kubernetes.io/component=redis
              app.kubernetes.io/instance=cluster-oqroov
              app.kubernetes.io/managed-by=kubeblocks
              app.kubernetes.io/name=redis
              app.kubernetes.io/version=redis-7.0.6
              apps.kubeblocks.io/component-name=redis
              apps.kubeblocks.io/workload-type=Replication
              controller-revision-hash=cluster-oqroov-redis-5b74946954
              kubeblocks.io/role=secondary
              rsm.workloads.kubeblocks.io/access-mode=Readonly
              statefulset.kubernetes.io/pod-name=cluster-oqroov-redis-0
Annotations:  apps.kubeblocks.io/component-replicas: 2
              apps.kubeblocks.io/last-role-changed-event-timestamp: 2023-09-13T01:52:16Z
              rs.apps.kubeblocks.io/primary: cluster-oqroov-redis-1
Status:       Running
IP:           10.104.2.112
IPs:
  IP:           10.104.2.112
Controlled By:  StatefulSet/cluster-oqroov-redis
Init Containers:
  role-agent-installer:
    Container ID:  containerd://84064adfa884ccf1e192ed9455fc691776d8982239605f52481b9f3f4369a73a
    Image:         msoap/shell2http:1.16.0
    Image ID:      docker.io/msoap/shell2http@sha256:a20bdde2f679de2cba6bf3d9f470489c7836d4d0d28232a2b295450809cd43ef
    Port:          <none>
    Host Port:     <none>
    Command:
      cp
      /app/shell2http
      /role-probe/agent
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 13 Sep 2023 09:49:40 +0800
      Finished:     Wed, 13 Sep 2023 09:49:40 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /role-probe from role-agent (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
Containers:
  redis:
    Container ID:  containerd://5952fc658ed832810e3d1aa0c1c20dd803c58b116e37d6ccbd259bf1e8a718e2
    Image:         registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8
    Image ID:      registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server@sha256:511808b267ab8d800283604ef5c01f4fe94792bfb746bb6dba236cc29ff5495b
    Port:          6379/TCP
    Host Port:     0/TCP
    Command:
      /scripts/redis-start.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 13 Sep 2023 10:12:44 +0800
      Finished:     Wed, 13 Sep 2023 10:12:44 +0800
    Ready:          False
    Restart Count:  8
    Limits:
      cpu:     500m
      memory:  1Gi
    Requests:
      cpu:      500m
      memory:   1Gi
    Readiness:  exec [sh -c /scripts/redis-ping.sh 1] delay=10s timeout=1s period=5s #success=1 #failure=5
    Environment Variables from:
      cluster-oqroov-redis-env      ConfigMap  Optional: false
      cluster-oqroov-redis-rsm-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:               cluster-oqroov-redis-0 (v1:metadata.name)
      KB_POD_UID:                 (v1:metadata.uid)
      KB_NAMESPACE:              default (v1:metadata.namespace)
      KB_SA_NAME:                 (v1:spec.serviceAccountName)
      KB_NODENAME:                (v1:spec.nodeName)
      KB_HOST_IP:                 (v1:status.hostIP)
      KB_POD_IP:                  (v1:status.podIP)
      KB_POD_IPS:                 (v1:status.podIPs)
      KB_HOSTIP:                  (v1:status.hostIP)
      KB_PODIP:                   (v1:status.podIP)
      KB_PODIPS:                  (v1:status.podIPs)
      KB_CLUSTER_NAME:           cluster-oqroov
      KB_COMP_NAME:              redis
      KB_CLUSTER_COMP_NAME:      cluster-oqroov-redis
      KB_CLUSTER_UID_POSTFIX_8:  01b62888
      KB_POD_FQDN:               $(KB_POD_NAME).$(KB_CLUSTER_COMP_NAME)-headless.$(KB_NAMESPACE).svc
      REDIS_REPL_USER:           kbreplicator
      REDIS_REPL_PASSWORD:       <set to the key 'password' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      REDIS_DEFAULT_PASSWORD:    <set to the key 'password' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      REDIS_SENTINEL_USER:       $(REDIS_REPL_USER)-sentinel
      REDIS_SENTINEL_PASSWORD:   <set to the key 'password' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      REDIS_ARGS:                --requirepass $(REDIS_PASSWORD)
    Mounts:
      /data from data (rw)
      /etc/conf from redis-config (rw)
      /etc/redis from redis-conf (rw)
      /kb-podinfo from pod-info (rw)
      /scripts from scripts (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
  metrics:
    Container ID:  containerd://14f7936053e3031c1968f5e6d000145f5019fa2f8f19c5689f7c7b2682a01f95
    Image:         registry.cn-hangzhou.aliyuncs.com/apecloud/agamotto:0.1.2-beta.1
    Image ID:      registry.cn-hangzhou.aliyuncs.com/apecloud/agamotto@sha256:cbab349b90490807a8d5039bf01bc7e37334f20c98c7dd75bc7fc4cf9e5b10ee
    Port:          9121/TCP
    Host Port:     0/TCP
    Command:
      /bin/agamotto
      --config=/opt/conf/metrics-config.yaml
    State:          Running
      Started:      Wed, 13 Sep 2023 09:49:41 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:     0
      memory:  0
    Environment Variables from:
      cluster-oqroov-redis-env      ConfigMap  Optional: false
      cluster-oqroov-redis-rsm-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:               cluster-oqroov-redis-0 (v1:metadata.name)
      KB_POD_UID:                 (v1:metadata.uid)
      KB_NAMESPACE:              default (v1:metadata.namespace)
      KB_SA_NAME:                 (v1:spec.serviceAccountName)
      KB_NODENAME:                (v1:spec.nodeName)
      KB_HOST_IP:                 (v1:status.hostIP)
      KB_POD_IP:                  (v1:status.podIP)
      KB_POD_IPS:                 (v1:status.podIPs)
      KB_HOSTIP:                  (v1:status.hostIP)
      KB_PODIP:                   (v1:status.podIP)
      KB_PODIPS:                  (v1:status.podIPs)
      KB_CLUSTER_NAME:           cluster-oqroov
      KB_COMP_NAME:              redis
      KB_CLUSTER_COMP_NAME:      cluster-oqroov-redis
      KB_CLUSTER_UID_POSTFIX_8:  01b62888
      KB_POD_FQDN:               $(KB_POD_NAME).$(KB_CLUSTER_COMP_NAME)-headless.$(KB_NAMESPACE).svc
      ENDPOINT:                  localhost:6379
      REDIS_USER:                <set to the key 'username' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      REDIS_PASSWORD:            <set to the key 'password' in secret 'cluster-oqroov-conn-credential'>  Optional: false
    Mounts:
      /opt/conf from redis-metrics-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
  kb-checkrole:
    Container ID:  containerd://80244c44cdfb15e6d5976aeceefc4b63c63c352cebc4a05335cbda301b2c99b0
    Image:         registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.7.0-alpha.8
    Image ID:      registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools@sha256:70fc1072a6bfd03a31e0fe83377487271e7443dc7f52d5bca1af0a203ba3b96e
    Ports:         3501/TCP, 50001/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      lorry
      --app-id
      batch-sdk
      --dapr-http-port
      3501
      --dapr-grpc-port
      50001
      --log-level
      info
      --config
      /config/lorry/config.yaml
      --components-path
      /config/lorry/components
    State:          Running
      Started:      Wed, 13 Sep 2023 09:49:42 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     0
      memory:  0
    Requests:
      cpu:      0
      memory:   0
    Readiness:  http-get http://:3501/v1.0/bindings/redis%3Foperation=checkRole&workloadType=Replication delay=0s timeout=1s period=2s #success=1 #failure=2
    Startup:    tcp-socket :3501 delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment Variables from:
      cluster-oqroov-redis-env      ConfigMap  Optional: false
      cluster-oqroov-redis-rsm-env  ConfigMap  Optional: false
    Environment:
      KB_POD_NAME:                cluster-oqroov-redis-0 (v1:metadata.name)
      KB_POD_UID:                  (v1:metadata.uid)
      KB_NAMESPACE:               default (v1:metadata.namespace)
      KB_SA_NAME:                  (v1:spec.serviceAccountName)
      KB_NODENAME:                 (v1:spec.nodeName)
      KB_HOST_IP:                  (v1:status.hostIP)
      KB_POD_IP:                   (v1:status.podIP)
      KB_POD_IPS:                  (v1:status.podIPs)
      KB_HOSTIP:                   (v1:status.hostIP)
      KB_PODIP:                    (v1:status.podIP)
      KB_PODIPS:                   (v1:status.podIPs)
      KB_CLUSTER_NAME:            cluster-oqroov
      KB_COMP_NAME:               redis
      KB_CLUSTER_COMP_NAME:       cluster-oqroov-redis
      KB_CLUSTER_UID_POSTFIX_8:   01b62888
      KB_POD_FQDN:                $(KB_POD_NAME).$(KB_CLUSTER_COMP_NAME)-headless.$(KB_NAMESPACE).svc
      KB_SERVICE_USER:            <set to the key 'username' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      KB_SERVICE_PASSWORD:        <set to the key 'password' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      KB_SERVICE_PORT:            6379
      KB_SERVICE_ROLES:           {}
      KB_SERVICE_CHARACTER_TYPE:  redis
      KB_WORKLOAD_TYPE:           Replication
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
  action-0:
    Container ID:  containerd://307c8718dd83f39548ae51dca6b4e9ad09b02369b2861eeb0b54ba113a0311d8
    Image:         registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8
    Image ID:      registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server@sha256:511808b267ab8d800283604ef5c01f4fe94792bfb746bb6dba236cc29ff5495b
    Port:          <none>
    Host Port:     <none>
    Command:
      /role-probe/agent
      -port
      36501
      -export-all-vars
      -form
      /role
' | tr -d '$(redis-cli --user $KB_RSM_USERNAME --pass $KB_RSM_PASSWORD --no-auth-warning info | grep role | awk -F ':' '{print $2}' | tr '[:upper:]' '[:lower:]' | tr -d '
      ') && if [ "master" = "$Role" ]; then echo -n "primary"; else echo -n "secondary"; fi
    State:          Running
      Started:      Wed, 13 Sep 2023 09:49:42 +0800
    Ready:          True
    Restart Count:  0
    Environment:
      KB_RSM_USERNAME:  <set to the key 'username' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      KB_RSM_PASSWORD:  <set to the key 'password' in secret 'cluster-oqroov-conn-credential'>  Optional: false
    Mounts:
      /role-probe from role-agent (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
  role-observe:
    Container ID:  containerd://05fd2f2619f96d6de1dd775ecb9841a11f17362666de541032e23a11ed169947
    Image:         apecloud/kubeblocks-role-agent:latest
    Image ID:      docker.io/apecloud/kubeblocks-role-agent@sha256:094c90431b37fbdae13a85b491628fb05394f00de423a5686141ec63867181c2
    Port:          7373/TCP
    Host Port:     0/TCP
    Command:
      role-agent
      --port
      7373
    State:          Running
      Started:      Wed, 13 Sep 2023 09:49:42 +0800
    Ready:          True
    Restart Count:  0
    Readiness:      exec [/bin/grpc_health_probe -addr=localhost:7373] delay=0s timeout=1s period=2s #success=1 #failure=2
    Environment:
      KB_RSM_USERNAME:         <set to the key 'username' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      KB_RSM_PASSWORD:         <set to the key 'password' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      KB_RSM_ACTION_SVC_LIST:  [36501]
      KB_SERVICE_USER:         <set to the key 'username' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      KB_SERVICE_PASSWORD:     <set to the key 'password' in secret 'cluster-oqroov-conn-credential'>  Optional: false
      KB_RSM_SERVICE_PORT:     6379
      KB_SERVICE_PORT:         6379
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8lmrv (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-cluster-oqroov-redis-0
    ReadOnly:   false
  pod-info:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels['kubeblocks.io/role'] -> pod-role
      metadata.annotations['rs.apps.kubeblocks.io/primary'] -> primary-pod
      metadata.annotations['apps.kubeblocks.io/component-replicas'] -> component-replicas
  redis-metrics-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cluster-oqroov-redis-redis-metrics-config
    Optional:  false
  redis-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cluster-oqroov-redis-redis-replication-config
    Optional:  false
  scripts:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cluster-oqroov-redis-redis-scripts
    Optional:  false
  redis-conf:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  role-agent:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-8lmrv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 kb-data=true:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Normal   Scheduled               25m                default-scheduler        Successfully assigned default/cluster-oqroov-redis-0 to gke-yjtest-default-pool-47e27321-h6tl
  Normal   SuccessfulAttachVolume  25m                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-2782c97b-3626-4080-a161-f01910b4ce7c"
  Normal   Pulled                  25m                kubelet                  Container image "msoap/shell2http:1.16.0" already present on machine
  Normal   Created                 25m                kubelet                  Created container role-agent-installer
  Normal   Started                 25m                kubelet                  Started container role-agent-installer
  Normal   Created                 25m                kubelet                  Created container redis
  Normal   Pulled                  25m                kubelet                  Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8" already present on machine
  Normal   Started                 25m                kubelet                  Started container redis
  Normal   Pulled                  25m                kubelet                  Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/agamotto:0.1.2-beta.1" already present on machine
  Normal   Created                 25m                kubelet                  Created container metrics
  Normal   Started                 25m                kubelet                  Started container metrics
  Normal   Pulled                  25m                kubelet                  Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/kubeblocks-tools:0.7.0-alpha.8" already present on machine
  Normal   Created                 25m                kubelet                  Created container kb-checkrole
  Warning  Unhealthy               25m                kubelet                  Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"","role":"primary"}
  Normal   Pulled                  25m                kubelet                  Container image "registry.cn-hangzhou.aliyuncs.com/apecloud/redis-stack-server:7.0.6-RC8" already present on machine
  Normal   Created                 25m                kubelet                  Created container action-0
  Normal   Started                 25m                kubelet                  Started container action-0
  Normal   Pulled                  25m                kubelet                  Container image "apecloud/kubeblocks-role-agent:latest" already present on machine
  Normal   Started                 25m                kubelet                  Started container kb-checkrole
  Normal   Started                 25m                kubelet                  Started container role-observe
  Normal   Created                 25m                kubelet                  Created container role-observe
  Normal   checkRole               24m                sqlchannel               {"event":"Failed","message":"role check delay","operation":"checkRole","originalRole":""}
  Normal   checkRole               24m                sqlchannel               {"event":"Success","operation":"checkRole","originalRole":"","role":"primary"}
  Warning  Unhealthy               23m                kubelet                  Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"primary","role":"primary"}
  Warning  Unhealthy               22m (x5 over 23m)  kubelet                  Readiness probe failed: MISCONF Errors writing to the AOF file: No space left on device
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
  Normal   checkRole  22m                 sqlchannel  {"event":"Success","operation":"checkRole","originalRole":"primary","role":"secondary"}
  Normal   checkRole  22m                 sqlchannel  {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
  Normal   checkRole  22m                 sqlchannel  {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
  Normal   checkRole  22m                 sqlchannel  {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
  Normal   checkRole  21m                 sqlchannel  {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
  Normal   checkRole  21m                 sqlchannel  {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
  Normal   checkRole  21m                 sqlchannel  {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
  Normal   checkRole  21m                 sqlchannel  {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
  Warning  Unhealthy  18m (x2 over 20m)   kubelet     Readiness probe failed: error: health rpc failed: rpc error: code = Unknown desc = {"event":"Success","originalRole":"secondary","role":"secondary"}
  Normal   checkRole  18m                 sqlchannel  {"event":"Failed","message":"context deadline exceeded","operation":"checkRole","originalRole":"secondary"}
  Warning  BackOff    10s (x91 over 18m)  kubelet     Back-off restarting failed container redis in pod cluster-oqroov-redis-0_default(d2c9746c-1aac-4301-99f8-2dc5682260c1)

k logs kubeblocks-645cc6c9bd-ppvbm >kblog.txt
Defaulted container "manager" out of: manager, tools (init), datascript (init)
kblog.txt

@ahjing99 ahjing99 changed the title [BUG]Redis keeps crashing and cannot recover from memory fault [BUG]Redis keeps crashing and cannot recover from memory/io fault Sep 13, 2023
@github-actions
Copy link

This issue has been marked as stale because it has been open for 30 days with no activity

@nayutah
Copy link
Collaborator

nayutah commented Apr 24, 2024

This issue seems caused by lack of disk space "Warning Unhealthy 22m (x5 over 23m) kubelet Readiness probe failed: MISCONF Errors writing to the AOF file: No space left on device", the default disk size 5G is too small

@ahjing99
Copy link
Collaborator Author

Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug kind/bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

4 participants