You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A deployment is repeatedly restarting due to running out of ephemeral storage, resulting in thousands of pods accumulating in the Error or ContainerStatusUnknown state. This excessive pod sprawl causes provider's status endpoint is reporting abnormally high cpu and mem values.
Quick workaround is to keep deleting Failed pods routinely (say via crontab job):
kubectl delete pods -A -l akash.network=true --field-selector status.phase=Failed
crontab job that cleans Failed akash pods every 10 minutes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m5s default-scheduler Successfully assigned orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m/service-1-5f4dfbc6bf-x9z78 to worker-01.hurricane2
Normal Pulled 2m4s kubelet Container image "andrey01/falcon7b:0.4" already present on machine
Normal Created 2m4s kubelet Created container service-1
Normal Started 2m4s kubelet Started container service-1
Warning Evicted 52s kubelet Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
Normal Killing 52s kubelet Stopping container service-1
Warning ExceededGracePeriod 42s kubelet Container runtime did not kill the pod within specified grace period.
events per namespace
root@control-01:~# kubectl get events -n $ns --sort-by='{.metadata.creationTimestamp}'
LAST SEEN TYPE REASON OBJECT MESSAGE
33s Normal SuccessfulCreate replicaset/service-1-5f4dfbc6bf (combined from similar events): Created pod: service-1-5f4dfbc6bf-kwjqv
60m Warning Evicted pod/service-1-5f4dfbc6bf-crrk4 Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
60m Normal Killing pod/service-1-5f4dfbc6bf-crrk4 Stopping container service-1
60m Warning ExceededGracePeriod pod/service-1-5f4dfbc6bf-crrk4 Container runtime did not kill the pod within specified grace period.
60m Normal Scheduled pod/service-1-5f4dfbc6bf-xccpg Successfully assigned orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m/service-1-5f4dfbc6bf-xccpg to worker-01.hurricane2
60m Normal Pulled pod/service-1-5f4dfbc6bf-xccpg Container image "andrey01/falcon7b:0.4" already present on machine
60m Normal Started pod/service-1-5f4dfbc6bf-xccpg Started container service-1
60m Normal Created pod/service-1-5f4dfbc6bf-xccpg Created container service-1
58m Normal Killing pod/service-1-5f4dfbc6bf-xccpg Stopping container service-1
58m Warning Evicted pod/service-1-5f4dfbc6bf-xccpg Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
58m Warning ExceededGracePeriod pod/service-1-5f4dfbc6bf-xccpg Container runtime did not kill the pod within specified grace period.
58m Normal Scheduled pod/service-1-5f4dfbc6bf-pxdq8 Successfully assigned orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m/service-1-5f4dfbc6bf-pxdq8 to worker-01.hurricane2
58m Normal Started pod/service-1-5f4dfbc6bf-pxdq8 Started container service-1
58m Normal Pulled pod/service-1-5f4dfbc6bf-pxdq8 Container image "andrey01/falcon7b:0.4" already present on machine
58m Normal Created pod/service-1-5f4dfbc6bf-pxdq8 Created container service-1
57m Warning Evicted pod/service-1-5f4dfbc6bf-pxdq8 Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
57m Normal Killing pod/service-1-5f4dfbc6bf-pxdq8 Stopping container service-1
57m Warning ExceededGracePeriod pod/service-1-5f4dfbc6bf-pxdq8 Container runtime did not kill the pod within specified grace period.
56m Normal Created pod/service-1-5f4dfbc6bf-fmwdg Created container service-1
56m Normal Scheduled pod/service-1-5f4dfbc6bf-fmwdg Successfully assigned orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m/service-1-5f4dfbc6bf-fmwdg to worker-01.hurricane2
56m Normal Pulled pod/service-1-5f4dfbc6bf-fmwdg Container image "andrey01/falcon7b:0.4" already present on machine
56m Normal Started pod/service-1-5f4dfbc6bf-fmwdg Started container service-1
55m Normal Killing pod/service-1-5f4dfbc6bf-fmwdg Stopping container service-1
55m Warning Evicted pod/service-1-5f4dfbc6bf-fmwdg Pod ephemeral local storage usage exceeds the total limit of containers 1073741824.
55m Warning ExceededGracePeriod pod/service-1-5f4dfbc6bf-fmwdg Container runtime did not kill the pod within specified grace period.
54m Normal Scheduled pod/service-1-5f4dfbc6bf-4pnk6 Successfully assigned orlnimr0es3uj3kl7h7ha8ibup2lsdkjqmi1sjbvq1n7m/service-1-5f4dfbc6bf-4pnk6 to worker-01.hurricane2
...
...
the deployment allocates only 1GiB for the ephemeral storage, which is not enough to accommodate the data it downloads from huggingface upon start:
root@control-01:~# kubectl -n $ns get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
service-1-5f4dfbc6bf-x9z78 1/1 Running 0 68s 10.233.73.172 worker-01.hurricane2 <none> <none>
root@control-01:~# kubectl -n $ns logs deploy/service-1 -f
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading shards: 0%| | 0/2 [00:00<?, ?it/s]
andy108369
changed the title
Deployment Exhausting Ephemeral Storage Causing Pod Sprawl and Inflated CPU Reports on Provider Status Endpoint
Deployment Exhausting Ephemeral Storage Causing Pod Sprawl and Inflated CPU/MEM Reports on Provider Status Endpoint
Jan 17, 2025
provider
0.6.5-rc6
node
0.36.0
A deployment is repeatedly restarting due to running out of ephemeral storage, resulting in thousands of pods accumulating in the
Error
orContainerStatusUnknown
state. This excessive pod sprawl causes provider's status endpoint is reporting abnormally highcpu
andmem
values.Quick workaround is to keep deleting Failed pods routinely (say via crontab job):
Once Failed pods are deleted, the provider's status endpoint returns back to normal reporting.
Tech. data
1GiB
for the ephemeral storage, which is not enough to accommodate the data it downloads from huggingface upon start:It gets closed and re-spawns again:
The text was updated successfully, but these errors were encountered: