InferenceService should recover from FailedToLoad #208

cfchase · 2023-09-20T17:57:56Z

There was a storage issue when creating an InferenceService (through the UI). This resulted in the status.states.activeModelState: FailedToLoad. It never reconciled to a good state, even when the storage error was fixed. The InferenceService didn't retry to download and reconcile. In order to fix it, I had to update the InferenceService which triggered a reload of the model.

Perhaps it would be useful to periodically try to reload InferenceServices that are in the FailedToLoad state? Or perhaps the UI could try and refresh/reload?

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"serving.kserve.io/v1beta1","kind":"InferenceService","metadata":{"annotations":{"openshift.io/display-name":"stocks","serving.kserve.io/deploymentMode":"ModelMesh"},"labels":{"name":"stocks","opendatahub.io/dashboard":"true"},"name":"stocks","namespace":"pipelines-tutorial"},"spec":{"predictor":{"model":{"modelFormat":{"name":"onnx","version":"1"},"runtime":"stocks","storage":{"key":"minio-connection","path":"stocks.onnx"}}}}}
    openshift.io/display-name: stocks
    serving.kserve.io/deploymentMode: ModelMesh
  name: stocks
  namespace: pipelines-tutorial
  labels:
    name: stocks
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    model:
      modelFormat:
        name: onnx
        version: '1'
      runtime: stocks
      storage:
        key: minio-connection
        path: stocks.onnx
status:
  conditions:
    - lastTransitionTime: '2023-09-20T15:15:46Z'
      status: 'False'
      type: PredictorReady
    - lastTransitionTime: '2023-09-20T15:15:46Z'
      status: 'False'
      type: Ready
  modelStatus:
    copies:
      failedCopies: 1
      totalCopies: 1
    lastFailureInfo:
      location: 94c77f-9djbq
      message: "Failed to pull model from storage due to error: unable to list objects in bucket 'models': NoSuchBucket: The specified bucket does not exist\n\tstatus code: 404, request id: 1786A448FC13E392, host id: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8"
      modelRevisionName: stocks__isvc-58f42146d7
      reason: ModelLoadFailed
      time: '2023-09-20T15:15:43Z'
    states:
      activeModelState: FailedToLoad
      targetModelState: ''
    transitionStatus: UpToDate

The text was updated successfully, but these errors were encountered:

israel-hdez · 2023-09-21T19:57:46Z

Although self-healing sounds right, current behavior seems to be on the safe side.

Retries is, perhaps, something that IMO should be turned off by default, but the user should be able to enable if desired (even per ISVC via annotations/fields, if needed).

We don't want to retry indefinitely to the point that the Cloud bill "scales" accordingly.

heyselbi · 2023-12-05T19:15:10Z

@cfchase thoughts?

cfchase · 2023-12-05T20:25:12Z

So, a number retries would probably fill the need, with a sane default. There probably still needs to be a way to trigger a reload through the UI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InferenceService should recover from FailedToLoad #208

InferenceService should recover from FailedToLoad #208

cfchase commented Sep 20, 2023 •

edited

Loading

israel-hdez commented Sep 21, 2023

heyselbi commented Dec 5, 2023

cfchase commented Dec 5, 2023

InferenceService should recover from FailedToLoad #208

InferenceService should recover from FailedToLoad #208

Comments

cfchase commented Sep 20, 2023 • edited Loading

israel-hdez commented Sep 21, 2023

heyselbi commented Dec 5, 2023

cfchase commented Dec 5, 2023

cfchase commented Sep 20, 2023 •

edited

Loading