You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There was a storage issue when creating an InferenceService (through the UI). This resulted in the status.states.activeModelState: FailedToLoad. It never reconciled to a good state, even when the storage error was fixed. The InferenceService didn't retry to download and reconcile. In order to fix it, I had to update the InferenceService which triggered a reload of the model.
Perhaps it would be useful to periodically try to reload InferenceServices that are in the FailedToLoad state? Or perhaps the UI could try and refresh/reload?
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: >
{"apiVersion":"serving.kserve.io/v1beta1","kind":"InferenceService","metadata":{"annotations":{"openshift.io/display-name":"stocks","serving.kserve.io/deploymentMode":"ModelMesh"},"labels":{"name":"stocks","opendatahub.io/dashboard":"true"},"name":"stocks","namespace":"pipelines-tutorial"},"spec":{"predictor":{"model":{"modelFormat":{"name":"onnx","version":"1"},"runtime":"stocks","storage":{"key":"minio-connection","path":"stocks.onnx"}}}}}
openshift.io/display-name: stocks
serving.kserve.io/deploymentMode: ModelMesh
name: stocks
namespace: pipelines-tutorial
labels:
name: stocks
opendatahub.io/dashboard: 'true'
spec:
predictor:
model:
modelFormat:
name: onnx
version: '1'
runtime: stocks
storage:
key: minio-connection
path: stocks.onnx
status:
conditions:
- lastTransitionTime: '2023-09-20T15:15:46Z'
status: 'False'
type: PredictorReady
- lastTransitionTime: '2023-09-20T15:15:46Z'
status: 'False'
type: Ready
modelStatus:
copies:
failedCopies: 1
totalCopies: 1
lastFailureInfo:
location: 94c77f-9djbq
message: "Failed to pull model from storage due to error: unable to list objects in bucket 'models': NoSuchBucket: The specified bucket does not exist\n\tstatus code: 404, request id: 1786A448FC13E392, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8"
modelRevisionName: stocks__isvc-58f42146d7
reason: ModelLoadFailed
time: '2023-09-20T15:15:43Z'
states:
activeModelState: FailedToLoad
targetModelState: ''
transitionStatus: UpToDate
The text was updated successfully, but these errors were encountered:
Although self-healing sounds right, current behavior seems to be on the safe side.
Retries is, perhaps, something that IMO should be turned off by default, but the user should be able to enable if desired (even per ISVC via annotations/fields, if needed).
We don't want to retry indefinitely to the point that the Cloud bill "scales" accordingly.
There was a storage issue when creating an InferenceService (through the UI). This resulted in the
status.states.activeModelState: FailedToLoad
. It never reconciled to a good state, even when the storage error was fixed. The InferenceService didn't retry to download and reconcile. In order to fix it, I had to update the InferenceService which triggered a reload of the model.Perhaps it would be useful to periodically try to reload InferenceServices that are in the
FailedToLoad
state? Or perhaps the UI could try and refresh/reload?The text was updated successfully, but these errors were encountered: