Deployment Exhausting Ephemeral Storage Causing Pod Sprawl and Inflated CPU/MEM Reports on Provider Status Endpoint #275

andy108369 · 2025-01-16T14:26:58Z

provider 0.6.5-rc6
node 0.36.0

A deployment is repeatedly restarting due to running out of ephemeral storage, resulting in thousands of pods accumulating in the Error or ContainerStatusUnknown state. This excessive pod sprawl causes provider's status endpoint is reporting abnormally high cpu and mem values.

Quick workaround is to keep deleting Failed pods routinely (say via crontab job):

kubectl delete pods -A -l akash.network=true --field-selector status.phase=Failed

crontab job that cleans Failed akash pods every 10 minutes:

root@control-01:~# cat /etc/cron.d/akash-delete-failed-pods
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
SHELL=/bin/bash
KUBECONFIG=/root/.kube/config

*/10 * * * * root kubectl delete pods -A -l akash.network=true --field-selector status.phase=Failed
root@control-01:~#

Once Failed pods are deleted, the provider's status endpoint returns back to normal reporting.

Tech. data

excessive cpu values

$ provider_info2.sh provider.hurricane.akash.pub
PROVIDER INFO
BALANCE: 690.683737
"hostname"                      "address"
"provider.hurricane.akash.pub"  "akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk"

Total/Available/Used (t/a/u) per node:
"name"                   "cpu(t/a/u)"                                "gpu(t/a/u)"  "mem(t/a/u GiB)"                        "ephemeral(t/a/u GiB)"
"control-01.hurricane2"  "2/1.1/0.9"                                 "0/0/0"       "7.65/7.4/0.25"                         "25.54/25.54/0"
"worker-01.hurricane2"   "102/18446744073709464/-18446744073709360"  "1/1/0"       "180.7/17179868714.76/-17179868534.06"  "1808.76/579.58/1229.18"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
76.7          0      116.25      207.18            0             0             416

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          163.25

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

failed pods

root@control-01:~# ns=orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m
root@control-01:~# kubectl -n $ns get pods
NAME                         READY   STATUS                   RESTARTS   AGE
service-1-5f4dfbc6bf-22qk2   0/1     Error                    0          31h
service-1-5f4dfbc6bf-245w7   0/1     Error                    0          7h20m
service-1-5f4dfbc6bf-24ck4   0/1     Error                    0          27h
service-1-5f4dfbc6bf-24jql   0/1     Error                    0          5h42m
service-1-5f4dfbc6bf-24r8n   0/1     Error                    0          7h6m
service-1-5f4dfbc6bf-27969   0/1     Error                    0          21h
service-1-5f4dfbc6bf-27w9j   0/1     Error                    0          9h
service-1-5f4dfbc6bf-27wvw   0/1     Error                    0          20h
service-1-5f4dfbc6bf-284dz   0/1     Error                    0          6h33m
service-1-5f4dfbc6bf-28lxv   0/1     Error                    0          28h
service-1-5f4dfbc6bf-28xh8   0/1     ContainerStatusUnknown   1          4h4m
service-1-5f4dfbc6bf-29qg8   0/1     Error                    0          21m
service-1-5f4dfbc6bf-29zcr   0/1     ContainerStatusUnknown   1          128m
service-1-5f4dfbc6bf-2cgw6   0/1     Error                    0          30h
service-1-5f4dfbc6bf-2dj4p   0/1     Error                    0          23h
service-1-5f4dfbc6bf-2dvvc   0/1     Error                    0          18h
service-1-5f4dfbc6bf-2hlf7   0/1     ContainerStatusUnknown   1          19h
service-1-5f4dfbc6bf-2htj7   0/1     Error                    0          7h56m
service-1-5f4dfbc6bf-2jbr9   0/1     Error                    0          6h50m
service-1-5f4dfbc6bf-2jgkc   0/1     ContainerStatusUnknown   1          28h
service-1-5f4dfbc6bf-2jrhw   0/1     Error                    0          23h
service-1-5f4dfbc6bf-2knn8   0/1     Error                    0          16h
service-1-5f4dfbc6bf-2mbt8   0/1     ContainerStatusUnknown   1          15h
...
...

kubectl describe pod

Events:
  Type     Reason               Age   From               Message
  ----     ------               ----  ----               -------
  Normal   Scheduled            2m5s  default-scheduler  Successfully assigned orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m/service-1-5f4dfbc6bf-x9z78 to worker-01.hurricane2
  Normal   Pulled               2m4s  kubelet            Container image "andrey01/falcon7b:0.4" already present on machine
  Normal   Created              2m4s  kubelet            Created container service-1
  Normal   Started              2m4s  kubelet            Started container service-1
  Warning  Evicted              52s   kubelet            Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
  Normal   Killing              52s   kubelet            Stopping container service-1
  Warning  ExceededGracePeriod  42s   kubelet            Container runtime did not kill the pod within specified grace period.

events per namespace

root@control-01:~# kubectl get events -n $ns --sort-by='{.metadata.creationTimestamp}'
LAST SEEN   TYPE      REASON                OBJECT                            MESSAGE
33s         Normal    SuccessfulCreate      replicaset/service-1-5f4dfbc6bf   (combined from similar events): Created pod: service-1-5f4dfbc6bf-kwjqv
60m         Warning   Evicted               pod/service-1-5f4dfbc6bf-crrk4    Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
60m         Normal    Killing               pod/service-1-5f4dfbc6bf-crrk4    Stopping container service-1
60m         Warning   ExceededGracePeriod   pod/service-1-5f4dfbc6bf-crrk4    Container runtime did not kill the pod within specified grace period.
60m         Normal    Scheduled             pod/service-1-5f4dfbc6bf-xccpg    Successfully assigned orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m/service-1-5f4dfbc6bf-xccpg to worker-01.hurricane2
60m         Normal    Pulled                pod/service-1-5f4dfbc6bf-xccpg    Container image "andrey01/falcon7b:0.4" already present on machine
60m         Normal    Started               pod/service-1-5f4dfbc6bf-xccpg    Started container service-1
60m         Normal    Created               pod/service-1-5f4dfbc6bf-xccpg    Created container service-1
58m         Normal    Killing               pod/service-1-5f4dfbc6bf-xccpg    Stopping container service-1
58m         Warning   Evicted               pod/service-1-5f4dfbc6bf-xccpg    Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
58m         Warning   ExceededGracePeriod   pod/service-1-5f4dfbc6bf-xccpg    Container runtime did not kill the pod within specified grace period.
58m         Normal    Scheduled             pod/service-1-5f4dfbc6bf-pxdq8    Successfully assigned orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m/service-1-5f4dfbc6bf-pxdq8 to worker-01.hurricane2
58m         Normal    Started               pod/service-1-5f4dfbc6bf-pxdq8    Started container service-1
58m         Normal    Pulled                pod/service-1-5f4dfbc6bf-pxdq8    Container image "andrey01/falcon7b:0.4" already present on machine
58m         Normal    Created               pod/service-1-5f4dfbc6bf-pxdq8    Created container service-1
57m         Warning   Evicted               pod/service-1-5f4dfbc6bf-pxdq8    Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
57m         Normal    Killing               pod/service-1-5f4dfbc6bf-pxdq8    Stopping container service-1
57m         Warning   ExceededGracePeriod   pod/service-1-5f4dfbc6bf-pxdq8    Container runtime did not kill the pod within specified grace period.
56m         Normal    Created               pod/service-1-5f4dfbc6bf-fmwdg    Created container service-1
56m         Normal    Scheduled             pod/service-1-5f4dfbc6bf-fmwdg    Successfully assigned orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m/service-1-5f4dfbc6bf-fmwdg to worker-01.hurricane2
56m         Normal    Pulled                pod/service-1-5f4dfbc6bf-fmwdg    Container image "andrey01/falcon7b:0.4" already present on machine
56m         Normal    Started               pod/service-1-5f4dfbc6bf-fmwdg    Started container service-1
55m         Normal    Killing               pod/service-1-5f4dfbc6bf-fmwdg    Stopping container service-1
55m         Warning   Evicted               pod/service-1-5f4dfbc6bf-fmwdg    Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
55m         Warning   ExceededGracePeriod   pod/service-1-5f4dfbc6bf-fmwdg    Container runtime did not kill the pod within specified grace period.
54m         Normal    Scheduled             pod/service-1-5f4dfbc6bf-4pnk6    Successfully assigned orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m/service-1-5f4dfbc6bf-4pnk6 to worker-01.hurricane2
...
...

manifest

root@control-01:~# kubectl -n lease get manifest $ns -o yaml
...
spec:
  group:
    name: dcloud
    services:
    - count: 1
      expose:
      - endpoint_sequence_number: 0
        external_port: 80
        global: true
        http_options:
          max_body_size: 1048576
          next_cases:
          - error
          - timeout
          next_tries: 3
          read_timeout: 60000
          send_timeout: 60000
        port: 80
        proto: TCP
      image: andrey01/falcon7b:0.4
      name: service-1
      resources:
        cpu:
          units: 100
        gpu:
          units: 0
        id: 1
        memory:
          size: "536870912"
        storage:
        - name: default
          size: "1073741824"

the deployment allocates only 1GiB for the ephemeral storage, which is not enough to accommodate the data it downloads from huggingface upon start:

root@control-01:~# kubectl -n $ns get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP              NODE                   NOMINATED NODE   READINESS GATES
service-1-5f4dfbc6bf-x9z78   1/1     Running   0          68s   10.233.73.172   worker-01.hurricane2   <none>           <none>

root@control-01:~# kubectl -n $ns logs deploy/service-1  -f
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.

WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

It gets closed and re-spawns again:

root@control-01:~# kubectl -n $ns get pods -o wide
NAME                         READY   STATUS                   RESTARTS   AGE     IP              NODE                   NOMINATED NODE   READINESS GATES
service-1-5f4dfbc6bf-4lt5r   0/1     ContainerStatusUnknown   1          10m     10.233.73.142   worker-01.hurricane2   <none>           <none>
service-1-5f4dfbc6bf-bjbgk   1/1     Running                  0          92s     10.233.73.172   worker-01.hurricane2   <none>           <none>
service-1-5f4dfbc6bf-kwjqv   0/1     Error                    0          6m37s   10.233.73.172   worker-01.hurricane2   <none>           <none>
service-1-5f4dfbc6bf-l57gp   0/1     Error                    0          8m21s   10.233.73.178   worker-01.hurricane2   <none>           <none>
service-1-5f4dfbc6bf-mfbzm   0/1     ContainerStatusUnknown   1          3m18s   10.233.73.178   worker-01.hurricane2   <none>           <none>
service-1-5f4dfbc6bf-t4vzb   0/1     Error                    0          4m52s   10.233.73.142   worker-01.hurricane2   <none>           <none>
service-1-5f4dfbc6bf-x9z78   0/1     Error                    0          11m     10.233.73.172   worker-01.hurricane2   <none>           <none>
root@control-01:~#

The text was updated successfully, but these errors were encountered:

andy108369 added the awaiting-triage label Jan 16, 2025

andy108369 changed the title ~~Deployment Exhausting Ephemeral Storage Causing Pod Sprawl and Inflated CPU Reports on Provider Status Endpoint~~ Deployment Exhausting Ephemeral Storage Causing Pod Sprawl and Inflated CPU/MEM Reports on Provider Status Endpoint Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment Exhausting Ephemeral Storage Causing Pod Sprawl and Inflated CPU/MEM Reports on Provider Status Endpoint #275

Deployment Exhausting Ephemeral Storage Causing Pod Sprawl and Inflated CPU/MEM Reports on Provider Status Endpoint #275

andy108369 commented Jan 16, 2025 •

edited

Loading

Deployment Exhausting Ephemeral Storage Causing Pod Sprawl and Inflated CPU/MEM Reports on Provider Status Endpoint #275

Deployment Exhausting Ephemeral Storage Causing Pod Sprawl and Inflated CPU/MEM Reports on Provider Status Endpoint #275

Comments

andy108369 commented Jan 16, 2025 • edited Loading

Tech. data

andy108369 commented Jan 16, 2025 •

edited

Loading