From 53fc8bd792af7399925f82d25a949658d0e8dc86 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Fri, 31 May 2024 18:58:34 +0000
Subject: [PATCH 01/19] first draft of a KEP

---
 keps/prod-readiness/sig-node/4680.yaml        |   3 +
 .../README.md                                 | 762 ++++++++++++++++++
 .../kep.yaml                                  |  42 +
 3 files changed, 807 insertions(+)
 create mode 100644 keps/prod-readiness/sig-node/4680.yaml
 create mode 100644 keps/sig-node/4680-add-resource-health-to-pod-status/README.md
 create mode 100644 keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml

diff --git a/keps/prod-readiness/sig-node/4680.yaml b/keps/prod-readiness/sig-node/4680.yaml
new file mode 100644
index 00000000000..88de2f21eb9
--- /dev/null
+++ b/keps/prod-readiness/sig-node/4680.yaml
@@ -0,0 +1,3 @@
+kep-number: 4680
+alpha:
+  approver: "@wojtek-t"
diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
new file mode 100644
index 00000000000..8f6db4c6b9e
--- /dev/null
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -0,0 +1,762 @@
+# KEP-4680: Add Resource Health Status to the Pod Status for Device Plugin and DRA
+
+<!-- toc -->
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [PodStatus.AllocatedResourcesStatus](#podstatusallocatedresourcesstatus)
+  - [Device Plugin implementation details](#device-plugin-implementation-details)
+  - [User Stories (Optional)](#user-stories-optional)
+    - [Story 1](#story-1)
+  - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Test Plan](#test-plan)
+      - [Prerequisite testing updates](#prerequisite-testing-updates)
+      - [Unit tests](#unit-tests)
+      - [Integration tests](#integration-tests)
+      - [e2e tests](#e2e-tests)
+  - [Graduation Criteria](#graduation-criteria)
+  - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+  - [Version Skew Strategy](#version-skew-strategy)
+- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
+  - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+  - [Monitoring Requirements](#monitoring-requirements)
+  - [Dependencies](#dependencies)
+  - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
+- [Implementation History](#implementation-history)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
+<!-- /toc -->
+
+## Release Signoff Checklist
+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+- [ ] (R) KEP approvers have approved the KEP status as `implementable`
+- [ ] (R) Design details are appropriately documented
+- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
+  - [ ] e2e Tests for all Beta API Operations (endpoints)
+  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) 
+  - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
+- [ ] (R) Graduation criteria is in place
+  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) 
+- [ ] (R) Production readiness review completed
+- [ ] (R) Production readiness review approved
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+<!--
+**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
+-->
+
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes/enhancements]: https://git.k8s.io/enhancements
+[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
+[kubernetes/website]: https://git.k8s.io/website
+
+## Summary
+
+Today it is hard to impossible to know when the Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP aims to fix this by exposing device health via Pod Status. The KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
+
+## Motivation
+
+Device Plugin and DRA do not have a good failure handling strategy defined. With proliferation of workloads using devices (like GPU), variable quality of devices, and overcommitting of data centers on power, there are cases when devices can fail temporarily or permanently and k8s need to handle this natively.
+
+Today, the typical design is for jobs consuming a failing device to fail itself with the specific error code whenever possible. For the inference of long running workloads, k8s will keep restarting the workload without reallocating it on a different device. So container will be in crash loop backoff with limited information on why it is crashing.
+
+People develop strategies to deal with such situations. Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device and be able to respond to this properly.
+
+### Goals
+
+- Expose device health information (served by Device Plugin or DRA) in Pod Status and events.
+
+### Non-Goals
+
+- Expose any other device information beyond the health.
+- Expose CPU assignment of the pod by CPU manager or any other resources assignment by other managers.
+
+## Proposal
+
+### PodStatus.AllocatedResourcesStatus
+
+As part of the InPlacePodVerticalScaling KEP, the two new fields were introduced in Pod Status to reflect the currently allocated resources for the Pod:
+
+```
+	// AllocatedResources represents the compute resources allocated for this container by the
+	// node. Kubelet sets this value to Container.Resources.Requests upon successful pod admission
+	// and after successfully admitting desired pod resize.
+	// +featureGate=InPlacePodVerticalScaling
+	// +optional
+	AllocatedResources ResourceList `json:"allocatedResources,omitempty" protobuf:"bytes,10,rep,name=allocatedResources,casttype=ResourceList,castkey=ResourceName"`
+
+	// Resources represents the compute resource requests and limits that have been successfully
+	// enacted on the running container after it has been started or has been successfully resized.
+	// +featureGate=InPlacePodVerticalScaling
+	// +optional
+	Resources *ResourceRequirements `json:"resources,omitempty" protobuf:"bytes,11,opt,name=resources"`
+```
+
+One field reflects the resource requests and limits and the other actual allocated resources.
+
+This structure will contain standard resources as well as extended resources. As noted in the comment: https://github.com/kubernetes/kubernetes/pull/124227#issuecomment-2130503713, it is only logical to also include the status of those allocated resources. 
+
+The proposal is to keep this structure as-is to simplify parsing of well-known ResourceList data type by various consumers. Typical scenario would be to compare if the AllocatedResources match the desired state.
+
+The proposal is to introduce an additional field:
+
+```
+// AllocatedResourcesStatus represents the status of various resources
+// allocated for this Pod.
+AllocatedResourcesStatus ResourcesStatus 
+
+type ResourcesStatus map[ResourceName]ResourceStatus
+
+type ResourceStatus struct {
+    // map of unique Resource ID to its status
+    Resources map[ResourceID] ResourceStatus
+    
+    // allow to extend this struct in future with the overall health fields or things like Device Plugin version
+}
+
+type ResourceID string
+
+type ResourceStatus struct {
+    // can be one of:
+    //  - Healthy: operates as normal
+    //  - Unhealthy: reported unhealthy. We consider this a temporary health issue since we do not mechanism today to distinguish temporary and permanent issues.
+    //  - Unknown: The status cannot be determined. For example, Device Plugin got unregistered and hasn't been re-registered since.
+    Status string
+}
+```
+
+***Is there any guarantee that the AllocatedResourcesStatus will be updated before Container crashed and unscheduled?***
+
+No, there is no guarantee that the Device Plugin/DRA will detect device going unhealthy earlier than the Pod. Once device got unhealthy, container may crash and being marked as Failed already.
+
+The proposal is to update the Pod Status with the device status even if the pod has been marked as Failed already, but still known to the kubelet.
+
+
+***Do we need the CheckDeviceHealth call introduced to the Device Plugin to work around the limitation above?***
+
+We may consider this as a future improvement. 
+
+
+***Should we introduce a permanent failure status?***
+
+We may consider this as a future improvement. 
+
+### Device Plugin implementation details
+
+Kubelet already keeps track of healthy and unhealthy devices as well as the mapping of those devices to Pods.
+
+One improvement will be needed is to distinguish unhealthy devices (marked unhealthy explicitly) and when device plugin was unregistered.
+
+NVIDIA device plugin has the checkHealth implementation: https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39 that has more information than simple “Unhealthy”.
+
+We should consider introducing another field to the Status that will be a free form error information as a future improvement.
+
+
+### User Stories (Optional)
+
+#### Story 1
+
+- User scheduled a Pod using the GPU device
+- When GPU device fails, user sees the Pod is in crash loop backoff
+- User checks the Pod Status using `kubectl describe pod`
+- User sees the pod status indicating that the GPU device is not healthy
+- User or some (custom for now) controller deletes the Pod and replicaset reschedules it on another available GPU
+
+### Notes/Constraints/Caveats (Optional)
+
+<!--
+What are the caveats to the proposal?
+What are some important details that didn't come across above?
+Go in to as much detail as necessary here.
+This might be a good place to talk about core concepts and how they relate.
+-->
+
+### Risks and Mitigations
+
+<!--
+What are the risks of this proposal, and how do we mitigate? Think broadly.
+For example, consider both security and how this will impact the larger
+Kubernetes ecosystem.
+
+How will security be reviewed, and by whom?
+
+How will UX be reviewed, and by whom?
+
+Consider including folks who also work outside the SIG or subproject.
+-->
+
+## Design Details
+
+<!--
+This section should contain enough information that the specifics of your
+change are understandable. This may include API specs (though not always
+required) or even code snippets. If there's any ambiguity about HOW your
+proposal will be implemented, this is the place to discuss them.
+-->
+
+### Test Plan
+
+<!--
+**Note:** *Not required until targeted at a release.*
+The goal is to ensure that we don't accept enhancements with inadequate testing.
+
+All code is expected to have adequate tests (eventually with coverage
+expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
+when drafting this test plan.
+
+[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
+-->
+
+[ ] I/we understand the owners of the involved components may require updates to
+existing tests to make this code solid enough prior to committing the changes necessary
+to implement this enhancement.
+
+##### Prerequisite testing updates
+
+<!--
+Based on reviewers feedback describe what additional tests need to be added prior
+implementing this enhancement to ensure the enhancements have also solid foundations.
+-->
+
+##### Unit tests
+
+<!--
+In principle every added code should have complete unit test coverage, so providing
+the exact set of tests will not bring additional value.
+However, if complete unit test coverage is not possible, explain the reason of it
+together with explanation why this is acceptable.
+-->
+
+<!--
+Additionally, for Alpha try to enumerate the core package you will be touching
+to implement this enhancement and provide the current unit coverage for those
+in the form of:
+- <package>: <date> - <current test coverage>
+The data can be easily read from:
+https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
+
+This can inform certain test coverage improvements that we want to do before
+extending the production code to implement this enhancement.
+-->
+
+- `<package>`: `<date>` - `<test coverage>`
+
+##### Integration tests
+
+<!--
+Integration tests are contained in k8s.io/kubernetes/test/integration.
+Integration tests allow control of the configuration parameters used to start the binaries under test.
+This is different from e2e tests which do not allow configuration of parameters.
+Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
+-->
+
+<!--
+This question should be filled when targeting a release.
+For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
+
+For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
+https://storage.googleapis.com/k8s-triage/index.html
+-->
+
+- <test>: <link to test coverage>
+
+##### e2e tests
+
+<!--
+This question should be filled when targeting a release.
+For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
+
+For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
+https://storage.googleapis.com/k8s-triage/index.html
+
+We expect no non-infra related flakes in the last month as a GA graduation criteria.
+-->
+
+- <test>: <link to test coverage>
+
+### Graduation Criteria
+
+<!--
+**Note:** *Not required until targeted at a release.*
+
+Define graduation milestones.
+
+These may be defined in terms of API maturity, [feature gate] graduations, or as
+something else. The KEP should keep this high-level with a focus on what
+signals will be looked at to determine graduation.
+
+Consider the following in developing the graduation criteria for this enhancement:
+- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
+- [Feature gate][feature gate] lifecycle
+- [Deprecation policy][deprecation-policy]
+
+Clearly define what graduation means by either linking to the [API doc
+definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
+or by redefining what graduation means.
+
+In general we try to use the same stages (alpha, beta, GA), regardless of how the
+functionality is accessed.
+
+[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
+[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
+[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
+
+Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
+
+#### Alpha
+
+- Feature implemented behind a feature flag
+- Initial e2e tests completed and enabled
+
+#### Beta
+
+- Gather feedback from developers and surveys
+- Complete features A, B, C
+- Additional tests are in Testgrid and linked in KEP
+
+#### GA
+
+- N examples of real-world usage
+- N installs
+- More rigorous forms of testing—e.g., downgrade tests and scalability tests
+- Allowing time for feedback
+
+**Note:** Generally we also wait at least two releases between beta and
+GA/stable, because there's no opportunity for user feedback, or even bug reports,
+in back-to-back releases.
+
+**For non-optional features moving to GA, the graduation criteria must include
+[conformance tests].**
+
+[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
+
+#### Deprecation
+
+- Announce deprecation and support policy of the existing flag
+- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
+- Address feedback on usage/changed behavior, provided on GitHub issues
+- Deprecate the flag
+-->
+
+### Upgrade / Downgrade Strategy
+
+<!--
+If applicable, how will the component be upgraded and downgraded? Make sure
+this is in the test plan.
+
+Consider the following in developing an upgrade/downgrade strategy for this
+enhancement:
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade, in order to maintain previous behavior?
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade, in order to make use of the enhancement?
+-->
+
+### Version Skew Strategy
+
+<!--
+If applicable, how will the component handle version skew with other
+components? What are the guarantees? Make sure this is in the test plan.
+
+Consider the following in developing a version skew strategy for this
+enhancement:
+- Does this enhancement involve coordinating behavior in the control plane and nodes?
+- How does an n-3 kubelet or kube-proxy without this feature available behave when this feature is used?
+- How does an n-1 kube-controller-manager or kube-scheduler without this feature available behave when this feature is used?
+- Will any other components on the node change? For example, changes to CSI,
+  CRI or CNI may require updating that component before the kubelet.
+-->
+
+## Production Readiness Review Questionnaire
+
+<!--
+
+Production readiness reviews are intended to ensure that features merging into
+Kubernetes are observable, scalable and supportable; can be safely operated in
+production environments, and can be disabled or rolled back in the event they
+cause increased failures in production. See more in the PRR KEP at
+https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
+
+The production readiness review questionnaire must be completed and approved
+for the KEP to move to `implementable` status and be included in the release.
+
+In some cases, the questions below should also have answers in `kep.yaml`. This
+is to enable automation to verify the presence of the review, and to reduce review
+burden and latency.
+
+The KEP must have a approver from the
+[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
+team. Please reach out on the
+[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
+you need any help or guidance.
+-->
+
+### Feature Enablement and Rollback
+
+<!--
+This section must be completed when targeting alpha to a release.
+-->
+
+###### How can this feature be enabled / disabled in a live cluster?
+
+<!--
+Pick one of these and delete the rest.
+
+Documentation is available on [feature gate lifecycle] and expectations, as
+well as the [existing list] of feature gates.
+
+[feature gate lifecycle]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
+[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
+-->
+
+- [ ] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name:
+  - Components depending on the feature gate:
+- [ ] Other
+  - Describe the mechanism:
+  - Will enabling / disabling the feature require downtime of the control
+    plane?
+  - Will enabling / disabling the feature require downtime or reprovisioning
+    of a node?
+
+###### Does enabling the feature change any default behavior?
+
+<!--
+Any change of default behavior may be surprising to users or break existing
+automations, so be extremely careful here.
+-->
+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+<!--
+Describe the consequences on existing workloads (e.g., if this is a runtime
+feature, can it break the existing applications?).
+
+Feature gates are typically disabled by setting the flag to `false` and
+restarting the component. No other changes should be necessary to disable the
+feature.
+
+NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
+-->
+
+###### What happens if we reenable the feature if it was previously rolled back?
+
+###### Are there any tests for feature enablement/disablement?
+
+<!--
+The e2e framework does not currently support enabling or disabling feature
+gates. However, unit tests in each component dealing with managing data, created
+with and without the feature, are necessary. At the very least, think about
+conversion tests if API types are being modified.
+
+Additionally, for features that are introducing a new API field, unit tests that
+are exercising the `switch` of feature gate itself (what happens if I disable a
+feature gate after having objects written with the new field) are also critical.
+You can take a look at one potential example of such test in:
+https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
+-->
+
+### Rollout, Upgrade and Rollback Planning
+
+<!--
+This section must be completed when targeting beta to a release.
+-->
+
+###### How can a rollout or rollback fail? Can it impact already running workloads?
+
+<!--
+Try to be as paranoid as possible - e.g., what if some components will restart
+mid-rollout?
+
+Be sure to consider highly-available clusters, where, for example,
+feature flags will be enabled on some API servers and not others during the
+rollout. Similarly, consider large clusters and how enablement/disablement
+will rollout across nodes.
+-->
+
+###### What specific metrics should inform a rollback?
+
+<!--
+What signals should users be paying attention to when the feature is young
+that might indicate a serious problem?
+-->
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+
+<!--
+Describe manual testing that was done and the outcomes.
+Longer term, we may want to require automated upgrade/rollback tests, but we
+are missing a bunch of machinery and tooling and can't do that now.
+-->
+
+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+
+<!--
+Even if applying deprecation policies, they may still surprise some users.
+-->
+
+### Monitoring Requirements
+
+<!--
+This section must be completed when targeting beta to a release.
+
+For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field.
+-->
+
+###### How can an operator determine if the feature is in use by workloads?
+
+<!--
+Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
+checking if there are objects with field X set) may be a last resort. Avoid
+logs or events for this purpose.
+-->
+
+###### How can someone using this feature know that it is working for their instance?
+
+<!--
+For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
+for each individual pod.
+Pick one more of these and delete the rest.
+Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
+and operation of this feature.
+Recall that end users cannot usually observe component logs or access metrics.
+-->
+
+- [ ] Events
+  - Event Reason: 
+- [ ] API .status
+  - Condition name: 
+  - Other field: 
+- [ ] Other (treat as last resort)
+  - Details:
+
+###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
+
+<!--
+This is your opportunity to define what "normal" quality of service looks like
+for a feature.
+
+It's impossible to provide comprehensive guidance, but at the very
+high level (needs more precise definitions) those may be things like:
+  - per-day percentage of API calls finishing with 5XX errors <= 1%
+  - 99% percentile over day of absolute value from (job creation time minus expected
+    job creation time) for cron job <= 10%
+  - 99.9% of /health requests per day finish with 200 code
+
+These goals will help you determine what you need to measure (SLIs) in the next
+question.
+-->
+
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+<!--
+Pick one more of these and delete the rest.
+-->
+
+- [ ] Metrics
+  - Metric name:
+  - [Optional] Aggregation method:
+  - Components exposing the metric:
+- [ ] Other (treat as last resort)
+  - Details:
+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?
+
+<!--
+Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
+implementation difficulties, etc.).
+-->
+
+### Dependencies
+
+<!--
+This section must be completed when targeting beta to a release.
+-->
+
+###### Does this feature depend on any specific services running in the cluster?
+
+<!--
+Think about both cluster-level services (e.g. metrics-server) as well
+as node-level agents (e.g. specific version of CRI). Focus on external or
+optional services that are needed. For example, if this feature depends on
+a cloud provider API, or upon an external software-defined storage or network
+control plane.
+
+For each of these, fill in the following—thinking about running existing user workloads
+and creating new ones, as well as about cluster-level services (e.g. DNS):
+  - [Dependency name]
+    - Usage description:
+      - Impact of its outage on the feature:
+      - Impact of its degraded performance or high-error rates on the feature:
+-->
+
+### Scalability
+
+<!--
+For alpha, this section is encouraged: reviewers should consider these questions
+and attempt to answer them.
+
+For beta, this section is required: reviewers must answer these questions.
+
+For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field.
+-->
+
+###### Will enabling / using this feature result in any new API calls?
+
+<!--
+Describe them, providing:
+  - API call type (e.g. PATCH pods)
+  - estimated throughput
+  - originating component(s) (e.g. Kubelet, Feature-X-controller)
+Focusing mostly on:
+  - components listing and/or watching resources they didn't before
+  - API calls that may be triggered by changes of some Kubernetes resources
+    (e.g. update of object X triggers new updates of object Y)
+  - periodic API calls to reconcile state (e.g. periodic fetching state,
+    heartbeats, leader election, etc.)
+-->
+
+###### Will enabling / using this feature result in introducing new API types?
+
+<!--
+Describe them, providing:
+  - API type
+  - Supported number of objects per cluster
+  - Supported number of objects per namespace (for namespace-scoped objects)
+-->
+
+###### Will enabling / using this feature result in any new calls to the cloud provider?
+
+<!--
+Describe them, providing:
+  - Which API(s):
+  - Estimated increase:
+-->
+
+###### Will enabling / using this feature result in increasing size or count of the existing API objects?
+
+<!--
+Describe them, providing:
+  - API type(s):
+  - Estimated increase in size: (e.g., new annotation of size 32B)
+  - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
+-->
+
+###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
+
+<!--
+Look at the [existing SLIs/SLOs].
+
+Think about adding additional work or introducing new steps in between
+(e.g. need to do X to start a container), etc. Please describe the details.
+
+[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
+-->
+
+###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
+
+<!--
+Things to keep in mind include: additional in-memory state, additional
+non-trivial computations, excessive access to disks (including increased log
+volume), significant amount of data sent and/or received over network, etc.
+This through this both in small and large cases, again with respect to the
+[supported limits].
+
+[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
+-->
+
+###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
+
+<!--
+Focus not just on happy cases, but primarily on more pathological cases
+(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
+If any of the resources can be exhausted, how this is mitigated with the existing limits
+(e.g. pods per node) or new limits added by this KEP?
+
+Are there any tests that were run/should be run to understand performance characteristics better
+and validate the declared limits?
+-->
+
+### Troubleshooting
+
+<!--
+This section must be completed when targeting beta to a release.
+
+For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field.
+
+The Troubleshooting section currently serves the `Playbook` role. We may consider
+splitting it into a dedicated `Playbook` document (potentially with some monitoring
+details). For now, we leave it here.
+-->
+
+###### How does this feature react if the API server and/or etcd is unavailable?
+
+###### What are other known failure modes?
+
+<!--
+For each of them, fill in the following information by copying the below template:
+  - [Failure mode brief description]
+    - Detection: How can it be detected via metrics? Stated another way:
+      how can an operator troubleshoot without logging into a master or worker node?
+    - Mitigations: What can be done to stop the bleeding, especially for already
+      running user workloads?
+    - Diagnostics: What are the useful log messages and their required logging
+      levels that could help debug the issue?
+      Not required until feature graduated to beta.
+    - Testing: Are there any tests for failure mode? If not, describe why.
+-->
+
+###### What steps should be taken if SLOs are not being met to determine the problem?
+
+## Implementation History
+
+<!--
+Major milestones in the lifecycle of a KEP should be tracked in this section.
+Major milestones might include:
+- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
+- the `Proposal` section being merged, signaling agreement on a proposed design
+- the date implementation started
+- the first Kubernetes release where an initial version of the KEP was available
+- the version of Kubernetes where the KEP graduated to general availability
+- when the KEP was retired or superseded
+-->
+
+## Drawbacks
+
+<!--
+Why should this KEP _not_ be implemented?
+-->
+
+## Alternatives
+
+There are a few alternatives to this proposal.
+
+**First**, an API similar to Pod Resources API can be exposed by kubelet to query via kubectl or directly thru some node exposed port. The problem with this approach is:
+- it opens up a new API surface
+- It will be impossible to get status for Pods that have completed already
+
+**Second**, exposing the status for DRA via claims - this approach leads to a debate on how to ensure security so kubelet is limited to which statuses it can set. With this approach, there are mechanisms in place to ensure that kubelet updates status for Pods scheduled on that node.
+
+## Infrastructure Needed (Optional)
+
+<!--
+Use this section if you need things from the project/SIG. Examples include a
+new subproject, repos requested, or GitHub details. Listing these here allows a
+SIG to get the process for these resources started right away.
+-->
diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml b/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
new file mode 100644
index 00000000000..f1a57c32e6d
--- /dev/null
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
@@ -0,0 +1,42 @@
+title: Add Resource Health Status to the Pod Status for Device Plugin and DRA
+kep-number: 4680
+authors:
+  - "@SergeyKanzhelev"
+owning-sig: sig-node
+participating-sigs:
+  - sig-node
+status: provisional #|implementable|implemented|deferred|rejected|withdrawn|replaced
+creation-date: 2024-05-31
+reviewers:
+  - "@ffromani"
+approvers:
+  - "@mrunalp"
+
+see-also:
+  - "/keps/sig-node/1287-in-place-update-pod-resources" # for AllocatedResources status field
+  - "/keps/sig-storage/1790-recover-resize-failure" # PVC status
+  - "/keps/sig-node/3063-dynamic-resource-allocation" # DRA
+  - "/keps/sig-node/3573-device-plugin" # Device Plugin
+
+# The target maturity stage in the current dev cycle for this KEP.
+stage: alpha #|beta|stable
+
+# The most recent milestone for which work toward delivery of this KEP has been
+# done. This can be the current (upcoming) milestone, if it is being actively
+# worked on.
+latest-milestone: "v1.31"
+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v1.31"
+  beta: "v1.32"
+  stable: "v1.34"
+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+feature-gates:
+  - name: ResourceHealthStatus
+    components:
+      - kubelet
+      - kube-apiserver
+disable-supported: true

From 4cd53125e37c506b302962234fcb9058a1fe30b0 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Fri, 31 May 2024 21:50:43 +0000
Subject: [PATCH 02/19] PRR filled up

---
 .../README.md                                 | 512 +++---------------
 1 file changed, 79 insertions(+), 433 deletions(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index 8f6db4c6b9e..c8bd0f26e1e 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -8,18 +8,22 @@
   - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
   - [PodStatus.AllocatedResourcesStatus](#podstatusallocatedresourcesstatus)
-  - [Device Plugin implementation details](#device-plugin-implementation-details)
   - [User Stories (Optional)](#user-stories-optional)
     - [Story 1](#story-1)
   - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
+  - [Device Plugin implementation details](#device-plugin-implementation-details)
+  - [DRA implementation details](#dra-implementation-details)
   - [Test Plan](#test-plan)
       - [Prerequisite testing updates](#prerequisite-testing-updates)
       - [Unit tests](#unit-tests)
       - [Integration tests](#integration-tests)
       - [e2e tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
+    - [Alpha](#alpha)
+    - [Beta](#beta)
+    - [GA](#ga)
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
   - [Version Skew Strategy](#version-skew-strategy)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -154,17 +158,6 @@ We may consider this as a future improvement.
 
 We may consider this as a future improvement. 
 
-### Device Plugin implementation details
-
-Kubelet already keeps track of healthy and unhealthy devices as well as the mapping of those devices to Pods.
-
-One improvement will be needed is to distinguish unhealthy devices (marked unhealthy explicitly) and when device plugin was unregistered.
-
-NVIDIA device plugin has the checkHealth implementation: https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39 that has more information than simple “Unhealthy”.
-
-We should consider introducing another field to the Status that will be a free form error information as a future improvement.
-
-
 ### User Stories (Optional)
 
 #### Story 1
@@ -186,562 +179,219 @@ This might be a good place to talk about core concepts and how they relate.
 
 ### Risks and Mitigations
 
-<!--
-What are the risks of this proposal, and how do we mitigate? Think broadly.
-For example, consider both security and how this will impact the larger
-Kubernetes ecosystem.
+There is not many risks of this KEP. The biggest risk is that Device Plugins will not be
+able to detect device health reliable and fast enough to assign this status to the
+Pods, marked as `restartPolicy=Never`. End users will expect this field and the
+missing health information will confuse them.
 
-How will security be reviewed, and by whom?
+## Design Details
 
-How will UX be reviewed, and by whom?
+### Device Plugin implementation details
 
-Consider including folks who also work outside the SIG or subproject.
--->
+Kubelet already keeps track of healthy and unhealthy devices as well as the mapping of those devices to Pods.
 
-## Design Details
+One improvement will be needed is to distinguish unhealthy devices (marked unhealthy explicitly) and when device plugin was unregistered.
 
-<!--
-This section should contain enough information that the specifics of your
-change are understandable. This may include API specs (though not always
-required) or even code snippets. If there's any ambiguity about HOW your
-proposal will be implemented, this is the place to discuss them.
--->
+NVIDIA device plugin has the checkHealth implementation: https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39 that has more information than simple “Unhealthy”.
 
-### Test Plan
+We should consider introducing another field to the Status that will be a free form error information as a future improvement.
+### DRA implementation details
 
-<!--
-**Note:** *Not required until targeted at a release.*
-The goal is to ensure that we don't accept enhancements with inadequate testing.
+Today DRA does not return the health of the device back to kubelet. The proposal is to extend the
 
-All code is expected to have adequate tests (eventually with coverage
-expectations). Please adhere to the [Kubernetes testing guidelines][testing-guidelines]
-when drafting this test plan.
+type `NamedResourcesInstance` (from [pkg/apis/resource/namedresources.go](https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/apis/resource/namedresources.go#L29-L37)) to include the Health field the same way it is done in 
+the Device Plugin.
 
-[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
--->
+Kubelet will react on this field the same way as we propose to do it for the Device Plugin.
 
-[ ] I/we understand the owners of the involved components may require updates to
+### Test Plan
+
+[X] I/we understand the owners of the involved components may require updates to
 existing tests to make this code solid enough prior to committing the changes necessary
 to implement this enhancement.
 
 ##### Prerequisite testing updates
 
-<!--
-Based on reviewers feedback describe what additional tests need to be added prior
-implementing this enhancement to ensure the enhancements have also solid foundations.
--->
+Device Plugin and DRA are relatively new features and have a reasonable test coverage.
 
 ##### Unit tests
 
-<!--
-In principle every added code should have complete unit test coverage, so providing
-the exact set of tests will not bring additional value.
-However, if complete unit test coverage is not possible, explain the reason of it
-together with explanation why this is acceptable.
--->
-
-<!--
-Additionally, for Alpha try to enumerate the core package you will be touching
-to implement this enhancement and provide the current unit coverage for those
-in the form of:
-- <package>: <date> - <current test coverage>
-The data can be easily read from:
-https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
-
-This can inform certain test coverage improvements that we want to do before
-extending the production code to implement this enhancement.
--->
-
-- `<package>`: `<date>` - `<test coverage>`
+- `k8s.io/kubernetes/pkg/kubelet/cm/devicemanager`: `5/31/2024` - `84.1`
+- `k8s.io/kubernetes/pkg/kubelet/cm/dra`: `5/31/2024` - `59.2`
+- `k8s.io/kubernetes/pkg/kubelet/cm/dra/plugin`: `5/31/2024` - `34`
+- `k8s.io/kubernetes/pkg/kubelet/cm/dra/state`: `5/31/2024` - `98`
 
 ##### Integration tests
 
-<!--
-Integration tests are contained in k8s.io/kubernetes/test/integration.
-Integration tests allow control of the configuration parameters used to start the binaries under test.
-This is different from e2e tests which do not allow configuration of parameters.
-Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
--->
-
-<!--
-This question should be filled when targeting a release.
-For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
-
-For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
-https://storage.googleapis.com/k8s-triage/index.html
--->
-
-- <test>: <link to test coverage>
+N/A
 
 ##### e2e tests
 
-<!--
-This question should be filled when targeting a release.
-For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
+Planned tests:
 
-For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
-https://storage.googleapis.com/k8s-triage/index.html
+- Device marked unhealthy - the state is reflected in pod status
+- Device marked unhealthy and back to healthy after some time - pod status was changed to unhealthy temporarily
+- Device marked as unhealthy and back to healthy in quick succession - pod status reflects the latest health status
+- Pod failed due to unhealthy device, earlier than device plugin detected it. Pod status is still updated.
+- Pod is in crash loop backoff due to unhealthy device - pod status is updated to unhealthy
 
-We expect no non-infra related flakes in the last month as a GA graduation criteria.
--->
+Test coverage will be listed once tests are implemented.
 
 - <test>: <link to test coverage>
 
 ### Graduation Criteria
 
-<!--
-**Note:** *Not required until targeted at a release.*
-
-Define graduation milestones.
-
-These may be defined in terms of API maturity, [feature gate] graduations, or as
-something else. The KEP should keep this high-level with a focus on what
-signals will be looked at to determine graduation.
-
-Consider the following in developing the graduation criteria for this enhancement:
-- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
-- [Feature gate][feature gate] lifecycle
-- [Deprecation policy][deprecation-policy]
-
-Clearly define what graduation means by either linking to the [API doc
-definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
-or by redefining what graduation means.
-
-In general we try to use the same stages (alpha, beta, GA), regardless of how the
-functionality is accessed.
-
-[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
-[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
-[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
-
-Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
-
 #### Alpha
 
-- Feature implemented behind a feature flag
+- New field is introduced in Pod Status
+- Feature implemented in Device Manager behind a feature flag
 - Initial e2e tests completed and enabled
 
 #### Beta
 
-- Gather feedback from developers and surveys
-- Complete features A, B, C
-- Additional tests are in Testgrid and linked in KEP
+- Feature is implemented in DRA behind the feature flag
+- Complete e2e tests coverage
 
 #### GA
 
-- N examples of real-world usage
-- N installs
-- More rigorous forms of testing—e.g., downgrade tests and scalability tests
-- Allowing time for feedback
-
-**Note:** Generally we also wait at least two releases between beta and
-GA/stable, because there's no opportunity for user feedback, or even bug reports,
-in back-to-back releases.
-
-**For non-optional features moving to GA, the graduation criteria must include
-[conformance tests].**
-
-[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
-
-#### Deprecation
-
-- Announce deprecation and support policy of the existing flag
-- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
-- Address feedback on usage/changed behavior, provided on GitHub issues
-- Deprecate the flag
--->
+- Feedback is collected on usability of the field
+- Example of real-world usage with one of the device plugin. For example, NVIDIA Device Plugin
 
 ### Upgrade / Downgrade Strategy
 
-<!--
-If applicable, how will the component be upgraded and downgraded? Make sure
-this is in the test plan.
-
-Consider the following in developing an upgrade/downgrade strategy for this
-enhancement:
-- What changes (in invocations, configurations, API use, etc.) is an existing
-  cluster required to make on upgrade, in order to maintain previous behavior?
-- What changes (in invocations, configurations, API use, etc.) is an existing
-  cluster required to make on upgrade, in order to make use of the enhancement?
--->
+The feature exposes a new field based on information the Device Plugin already exposes. There will be no dependency on upgrade/downgrade, feature will either work or not.
+
+DRA implementation requires DRA interfaces change. DRA is in alpha and in active development. The feature will follow the DRA ugrade/downgrade strategy.
 
 ### Version Skew Strategy
 
-<!--
-If applicable, how will the component handle version skew with other
-components? What are the guarantees? Make sure this is in the test plan.
-
-Consider the following in developing a version skew strategy for this
-enhancement:
-- Does this enhancement involve coordinating behavior in the control plane and nodes?
-- How does an n-3 kubelet or kube-proxy without this feature available behave when this feature is used?
-- How does an n-1 kube-controller-manager or kube-scheduler without this feature available behave when this feature is used?
-- Will any other components on the node change? For example, changes to CSI,
-  CRI or CNI may require updating that component before the kubelet.
--->
+There is no issue with the version skew. Kubelet that will expose this flag will
+always be the same version of behind the API, which introduced this new field.
 
 ## Production Readiness Review Questionnaire
 
-<!--
-
-Production readiness reviews are intended to ensure that features merging into
-Kubernetes are observable, scalable and supportable; can be safely operated in
-production environments, and can be disabled or rolled back in the event they
-cause increased failures in production. See more in the PRR KEP at
-https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
-
-The production readiness review questionnaire must be completed and approved
-for the KEP to move to `implementable` status and be included in the release.
-
-In some cases, the questions below should also have answers in `kep.yaml`. This
-is to enable automation to verify the presence of the review, and to reduce review
-burden and latency.
-
-The KEP must have a approver from the
-[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
-team. Please reach out on the
-[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
-you need any help or guidance.
--->
-
 ### Feature Enablement and Rollback
 
-<!--
-This section must be completed when targeting alpha to a release.
--->
+Simple change of a feature gate will either enable or disable this feature.
 
 ###### How can this feature be enabled / disabled in a live cluster?
 
-<!--
-Pick one of these and delete the rest.
-
-Documentation is available on [feature gate lifecycle] and expectations, as
-well as the [existing list] of feature gates.
-
-[feature gate lifecycle]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
-[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
--->
 
-- [ ] Feature gate (also fill in values in `kep.yaml`)
-  - Feature gate name:
-  - Components depending on the feature gate:
-- [ ] Other
-  - Describe the mechanism:
-  - Will enabling / disabling the feature require downtime of the control
-    plane?
-  - Will enabling / disabling the feature require downtime or reprovisioning
-    of a node?
+- [X] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name: `ResourceHealthStatus`
+  - Components depending on the feature gate: `kubelet` and `kube-apiserver`
 
 ###### Does enabling the feature change any default behavior?
 
-<!--
-Any change of default behavior may be surprising to users or break existing
-automations, so be extremely careful here.
--->
+No
 
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
-<!--
-Describe the consequences on existing workloads (e.g., if this is a runtime
-feature, can it break the existing applications?).
-
-Feature gates are typically disabled by setting the flag to `false` and
-restarting the component. No other changes should be necessary to disable the
-feature.
-
-NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
--->
+Yes, with no side effect except of missing the new field in pod status.
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 
+The pod status will be updated again.
+
 ###### Are there any tests for feature enablement/disablement?
 
-<!--
-The e2e framework does not currently support enabling or disabling feature
-gates. However, unit tests in each component dealing with managing data, created
-with and without the feature, are necessary. At the very least, think about
-conversion tests if API types are being modified.
-
-Additionally, for features that are introducing a new API field, unit tests that
-are exercising the `switch` of feature gate itself (what happens if I disable a
-feature gate after having objects written with the new field) are also critical.
-You can take a look at one potential example of such test in:
-https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
--->
+Nothing is planned.
 
 ### Rollout, Upgrade and Rollback Planning
 
-<!--
-This section must be completed when targeting beta to a release.
--->
-
 ###### How can a rollout or rollback fail? Can it impact already running workloads?
 
-<!--
-Try to be as paranoid as possible - e.g., what if some components will restart
-mid-rollout?
-
-Be sure to consider highly-available clusters, where, for example,
-feature flags will be enabled on some API servers and not others during the
-rollout. Similarly, consider large clusters and how enablement/disablement
-will rollout across nodes.
--->
+No
 
 ###### What specific metrics should inform a rollback?
 
-<!--
-What signals should users be paying attention to when the feature is young
-that might indicate a serious problem?
--->
+N/A
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
-<!--
-Describe manual testing that was done and the outcomes.
-Longer term, we may want to require automated upgrade/rollback tests, but we
-are missing a bunch of machinery and tooling and can't do that now.
--->
+Will be tested, but we do not expect any issues.
 
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
-<!--
-Even if applying deprecation policies, they may still surprise some users.
--->
+No
 
 ### Monitoring Requirements
 
-<!--
-This section must be completed when targeting beta to a release.
-
-For GA, this section is required: approvers should be able to confirm the
-previous answers based on experience in the field.
--->
-
 ###### How can an operator determine if the feature is in use by workloads?
 
-<!--
-Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
-checking if there are objects with field X set) may be a last resort. Avoid
-logs or events for this purpose.
--->
+Check the Pod Status.
 
 ###### How can someone using this feature know that it is working for their instance?
 
-<!--
-For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
-for each individual pod.
-Pick one more of these and delete the rest.
-Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
-and operation of this feature.
-Recall that end users cannot usually observe component logs or access metrics.
--->
-
-- [ ] Events
-  - Event Reason: 
-- [ ] API .status
-  - Condition name: 
-  - Other field: 
-- [ ] Other (treat as last resort)
-  - Details:
+- [X] API pod.status
 
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
-<!--
-This is your opportunity to define what "normal" quality of service looks like
-for a feature.
-
-It's impossible to provide comprehensive guidance, but at the very
-high level (needs more precise definitions) those may be things like:
-  - per-day percentage of API calls finishing with 5XX errors <= 1%
-  - 99% percentile over day of absolute value from (job creation time minus expected
-    job creation time) for cron job <= 10%
-  - 99.9% of /health requests per day finish with 200 code
-
-These goals will help you determine what you need to measure (SLIs) in the next
-question.
--->
-
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
-<!--
-Pick one more of these and delete the rest.
--->
-
-- [ ] Metrics
-  - Metric name:
-  - [Optional] Aggregation method:
-  - Components exposing the metric:
-- [ ] Other (treat as last resort)
-  - Details:
+N/A
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-<!--
-Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
-implementation difficulties, etc.).
--->
+N/A
 
 ### Dependencies
 
-<!--
-This section must be completed when targeting beta to a release.
--->
+DRA implementation.
 
 ###### Does this feature depend on any specific services running in the cluster?
 
-<!--
-Think about both cluster-level services (e.g. metrics-server) as well
-as node-level agents (e.g. specific version of CRI). Focus on external or
-optional services that are needed. For example, if this feature depends on
-a cloud provider API, or upon an external software-defined storage or network
-control plane.
-
-For each of these, fill in the following—thinking about running existing user workloads
-and creating new ones, as well as about cluster-level services (e.g. DNS):
-  - [Dependency name]
-    - Usage description:
-      - Impact of its outage on the feature:
-      - Impact of its degraded performance or high-error rates on the feature:
--->
+No
 
 ### Scalability
 
-<!--
-For alpha, this section is encouraged: reviewers should consider these questions
-and attempt to answer them.
-
-For beta, this section is required: reviewers must answer these questions.
-
-For GA, this section is required: approvers should be able to confirm the
-previous answers based on experience in the field.
--->
-
 ###### Will enabling / using this feature result in any new API calls?
 
-<!--
-Describe them, providing:
-  - API call type (e.g. PATCH pods)
-  - estimated throughput
-  - originating component(s) (e.g. Kubelet, Feature-X-controller)
-Focusing mostly on:
-  - components listing and/or watching resources they didn't before
-  - API calls that may be triggered by changes of some Kubernetes resources
-    (e.g. update of object X triggers new updates of object Y)
-  - periodic API calls to reconcile state (e.g. periodic fetching state,
-    heartbeats, leader election, etc.)
--->
+Pod Status size will increase insignificantly.
 
 ###### Will enabling / using this feature result in introducing new API types?
 
-<!--
-Describe them, providing:
-  - API type
-  - Supported number of objects per cluster
-  - Supported number of objects per namespace (for namespace-scoped objects)
--->
+New field on Pod Status.
 
 ###### Will enabling / using this feature result in any new calls to the cloud provider?
 
-<!--
-Describe them, providing:
-  - Which API(s):
-  - Estimated increase:
--->
+No
 
 ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
 
-<!--
-Describe them, providing:
-  - API type(s):
-  - Estimated increase in size: (e.g., new annotation of size 32B)
-  - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
--->
+Pod Status size will increase insignificantly.
 
 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
 
-<!--
-Look at the [existing SLIs/SLOs].
-
-Think about adding additional work or introducing new steps in between
-(e.g. need to do X to start a container), etc. Please describe the details.
-
-[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
--->
+No
 
 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
 
-<!--
-Things to keep in mind include: additional in-memory state, additional
-non-trivial computations, excessive access to disks (including increased log
-volume), significant amount of data sent and/or received over network, etc.
-This through this both in small and large cases, again with respect to the
-[supported limits].
-
-[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
--->
+Not significantly. We already keep all the collection in memory, just need to connect dots.
 
 ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
 
-<!--
-Focus not just on happy cases, but primarily on more pathological cases
-(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
-If any of the resources can be exhausted, how this is mitigated with the existing limits
-(e.g. pods per node) or new limits added by this KEP?
-
-Are there any tests that were run/should be run to understand performance characteristics better
-and validate the declared limits?
--->
+No
 
 ### Troubleshooting
 
-<!--
-This section must be completed when targeting beta to a release.
-
-For GA, this section is required: approvers should be able to confirm the
-previous answers based on experience in the field.
-
-The Troubleshooting section currently serves the `Playbook` role. We may consider
-splitting it into a dedicated `Playbook` document (potentially with some monitoring
-details). For now, we leave it here.
--->
-
 ###### How does this feature react if the API server and/or etcd is unavailable?
 
+N/A
+
 ###### What are other known failure modes?
 
-<!--
-For each of them, fill in the following information by copying the below template:
-  - [Failure mode brief description]
-    - Detection: How can it be detected via metrics? Stated another way:
-      how can an operator troubleshoot without logging into a master or worker node?
-    - Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-    - Diagnostics: What are the useful log messages and their required logging
-      levels that could help debug the issue?
-      Not required until feature graduated to beta.
-    - Testing: Are there any tests for failure mode? If not, describe why.
--->
+Not applicable.
 
 ###### What steps should be taken if SLOs are not being met to determine the problem?
 
 ## Implementation History
 
-<!--
-Major milestones in the lifecycle of a KEP should be tracked in this section.
-Major milestones might include:
-- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
-- the `Proposal` section being merged, signaling agreement on a proposed design
-- the date implementation started
-- the first Kubernetes release where an initial version of the KEP was available
-- the version of Kubernetes where the KEP graduated to general availability
-- when the KEP was retired or superseded
--->
+- `v1.31`: KEP is in alpha
 
 ## Drawbacks
 
-<!--
-Why should this KEP _not_ be implemented?
--->
+Not that we can think of.
 
 ## Alternatives
 
@@ -755,8 +405,4 @@ There are a few alternatives to this proposal.
 
 ## Infrastructure Needed (Optional)
 
-<!--
-Use this section if you need things from the project/SIG. Examples include a
-new subproject, repos requested, or GitHub details. Listing these here allows a
-SIG to get the process for these resources started right away.
--->
+We may need to update sample device plugin. No special infra is needed as emulating real GPU failures or failures in other devices is not practical.

From 42dc7d69eba9584f740a4ea0d8795a185b63dd7f Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Sat, 8 Jun 2024 06:52:44 +0000
Subject: [PATCH 03/19] addressed some comments

---
 .../README.md                                 | 23 +++++++++++++++----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index c8bd0f26e1e..4da0d456c5d 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -125,7 +125,11 @@ AllocatedResourcesStatus ResourcesStatus
 type ResourcesStatus map[ResourceName]ResourceStatus
 
 type ResourceStatus struct {
-    // map of unique Resource ID to its status
+    // map of unique Resource ID to its status with the following restrictions:
+    //  - ResourceID must uniquely identify the Resource allocated to the Pod on the Node for the lifetime of a Pod.
+    //  - ResourceID may not make sense outside the Pod. Often it will identify the resource on the Node, but not guaranteed.
+    //  - ResourceID of one Pod must not be compared with the ResourceID of other Pod.
+    // In case of Device Plugin ResourceID maps to DeviceID.
     Resources map[ResourceID] ResourceStatus
     
     // allow to extend this struct in future with the overall health fields or things like Device Plugin version
@@ -136,18 +140,25 @@ type ResourceID string
 type ResourceStatus struct {
     // can be one of:
     //  - Healthy: operates as normal
-    //  - Unhealthy: reported unhealthy. We consider this a temporary health issue since we do not mechanism today to distinguish temporary and permanent issues.
-    //  - Unknown: The status cannot be determined. For example, Device Plugin got unregistered and hasn't been re-registered since.
+    //  - Unhealthy: reported unhealthy. We consider this a temporary health issue
+    //               since we do not mechanism today to distinguish
+    //               temporary and permanent issues.
+    //  - Unknown: The status cannot be determined.
+    //             For example, Device Plugin got unregistered and hasn't been re-registered since.
+    //
+    // In future we may want to introduce the PermanentlyUnhealthy Status.
     Status string
 }
 ```
 
 ***Is there any guarantee that the AllocatedResourcesStatus will be updated before Container crashed and unscheduled?***
 
-No, there is no guarantee that the Device Plugin/DRA will detect device going unhealthy earlier than the Pod. Once device got unhealthy, container may crash and being marked as Failed already.
+No, there is no guarantee that the Device Plugin/DRA will detect device going unhealthy earlier than the Pod. Once device got unhealthy, container may crash and being marked as Failed already (if `restartPolicy=Never`, in other cases Pod will enter crash loop backoff).
 
 The proposal is to update the Pod Status with the device status even if the pod has been marked as Failed already, but still known to the kubelet.
 
+This use case is important to explore so this status may inform the retry policy introduced by the KEP: [Retriable and non-retriable Pod failures for Jobs](https://github.com/kubernetes/enhancements/issues/3329).
+
 
 ***Do we need the CheckDeviceHealth call introduced to the Device Plugin to work around the limitation above?***
 
@@ -200,10 +211,12 @@ We should consider introducing another field to the Status that will be a free f
 Today DRA does not return the health of the device back to kubelet. The proposal is to extend the
 
 type `NamedResourcesInstance` (from [pkg/apis/resource/namedresources.go](https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/apis/resource/namedresources.go#L29-L37)) to include the Health field the same way it is done in 
-the Device Plugin.
+the Device Plugin as well as a device ID.
 
 Kubelet will react on this field the same way as we propose to do it for the Device Plugin.
 
+Specific implementation details will be added for the beta.
+
 ### Test Plan
 
 [X] I/we understand the owners of the involved components may require updates to

From 04188cdcbe3458b6e60fedf69f25f60707925674 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Mon, 10 Jun 2024 16:32:23 -0700
Subject: [PATCH 04/19] Update
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Co-authored-by: John Belamaric <jbelamaric@google.com>
---
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index 4da0d456c5d..d04e2edb8a0 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -69,7 +69,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-Today it is hard to impossible to know when the Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP aims to fix this by exposing device health via Pod Status. The KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
+Today it is difficult to know when a Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP will fix this by exposing device health via Pod Status. This KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
 
 ## Motivation
 

From b10035fcde402bde7a09126b37bc72e2f8c8ff5d Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Mon, 10 Jun 2024 16:32:30 -0700
Subject: [PATCH 05/19] Update
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Co-authored-by: John Belamaric <jbelamaric@google.com>
---
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index d04e2edb8a0..536dacce04d 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -75,7 +75,7 @@ Today it is difficult to know when a Pod is using a device that has failed or is
 
 Device Plugin and DRA do not have a good failure handling strategy defined. With proliferation of workloads using devices (like GPU), variable quality of devices, and overcommitting of data centers on power, there are cases when devices can fail temporarily or permanently and k8s need to handle this natively.
 
-Today, the typical design is for jobs consuming a failing device to fail itself with the specific error code whenever possible. For the inference of long running workloads, k8s will keep restarting the workload without reallocating it on a different device. So container will be in crash loop backoff with limited information on why it is crashing.
+Today, the typical design is for jobs consuming a failing device to fail with a specific error code whenever possible. For long running workloads, K8s will keep restarting the workload without reallocating it on a different device. So the container will be in crash loop backoff with limited information on why it is crashing.
 
 People develop strategies to deal with such situations. Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device and be able to respond to this properly.
 

From f3a88dbab9f5c2090e713994d5073d7f9b7fe058 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Mon, 10 Jun 2024 16:32:46 -0700
Subject: [PATCH 06/19] Update
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Co-authored-by: John Belamaric <jbelamaric@google.com>
---
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index 536dacce04d..d4c98d53fc0 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -77,7 +77,7 @@ Device Plugin and DRA do not have a good failure handling strategy defined. With
 
 Today, the typical design is for jobs consuming a failing device to fail with a specific error code whenever possible. For long running workloads, K8s will keep restarting the workload without reallocating it on a different device. So the container will be in crash loop backoff with limited information on why it is crashing.
 
-People develop strategies to deal with such situations. Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device and be able to respond to this properly.
+Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device, and be able to respond to this properly.
 
 ### Goals
 

From a803b0b2b21c9b38e3cc2b8d507ee37a47ce256a Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 17:35:11 +0000
Subject: [PATCH 07/19] addressed some comments

---
 .../README.md                                 | 35 ++++++++++++++-----
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index d4c98d53fc0..cd711361999 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -113,7 +113,7 @@ One field reflects the resource requests and limits and the other actual allocat
 
 This structure will contain standard resources as well as extended resources. As noted in the comment: https://github.com/kubernetes/kubernetes/pull/124227#issuecomment-2130503713, it is only logical to also include the status of those allocated resources. 
 
-The proposal is to keep this structure as-is to simplify parsing of well-known ResourceList data type by various consumers. Typical scenario would be to compare if the AllocatedResources match the desired state.
+The proposal is to keep this structure as-is to simplify parsing of well-known ResourceList data type by various consumers. Typical scenario would be to compare if the `AllocatedResources` match the desired state.
 
 The proposal is to introduce an additional field:
 
@@ -125,19 +125,23 @@ AllocatedResourcesStatus ResourcesStatus
 type ResourcesStatus map[ResourceName]ResourceStatus
 
 type ResourceStatus struct {
-    // map of unique Resource ID to its status with the following restrictions:
-    //  - ResourceID must uniquely identify the Resource allocated to the Pod on the Node for the lifetime of a Pod.
-    //  - ResourceID may not make sense outside the Pod. Often it will identify the resource on the Node, but not guaranteed.
-    //  - ResourceID of one Pod must not be compared with the ResourceID of other Pod.
-    // In case of Device Plugin ResourceID maps to DeviceID.
-    Resources map[ResourceID] ResourceStatus
+    // map of unique Resource ID to its health.
+    // At a minimum, ResourceID must uniquely identify the Resource
+    // allocated to the Pod on the Node for the lifetime of a Pod.
+    // See ResourceID type for it's definition.
+    Resources map[ResourceID] ResourceHealth
     
     // allow to extend this struct in future with the overall health fields or things like Device Plugin version
 }
 
 type ResourceID string
 
-type ResourceStatus struct {
+type ResourceHealth struct {
+    // List of conditions with the transition times
+    Conditions []ResourceHealthCondition
+}
+// This condition type is replicating other condition types exposed by various status APIs
+type ResourceHealthCondition struct {
     // can be one of:
     //  - Healthy: operates as normal
     //  - Unhealthy: reported unhealthy. We consider this a temporary health issue
@@ -147,7 +151,19 @@ type ResourceStatus struct {
     //             For example, Device Plugin got unregistered and hasn't been re-registered since.
     //
     // In future we may want to introduce the PermanentlyUnhealthy Status.
-    Status string
+    Type string 
+
+    // Status of the condition, one of True, False, Unknown.
+    Status ConditionStatus
+    // The last time the condition transitioned from one status to another.
+    // +optional
+    LastTransitionTime metav1.Time
+    // The reason for the condition's last transition.
+    // +optional
+    Reason string
+    // A human readable message indicating details about the transition.
+    // +optional
+    Message string
 }
 ```
 
@@ -206,6 +222,7 @@ One improvement will be needed is to distinguish unhealthy devices (marked unhea
 NVIDIA device plugin has the checkHealth implementation: https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39 that has more information than simple “Unhealthy”.
 
 We should consider introducing another field to the Status that will be a free form error information as a future improvement.
+
 ### DRA implementation details
 
 Today DRA does not return the health of the device back to kubelet. The proposal is to extend the

From 9acdc746029ab9369aa2d0f2959692edf19dabfd Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 17:41:56 +0000
Subject: [PATCH 08/19] added device ID clarification

---
 .../README.md                                       | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index cd711361999..c023dc8ffce 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -69,15 +69,15 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-Today it is difficult to know when a Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP will fix this by exposing device health via Pod Status. This KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
+Today it is hard to impossible to know when the Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP aims to fix this by exposing device health via Pod Status. The KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
 
 ## Motivation
 
 Device Plugin and DRA do not have a good failure handling strategy defined. With proliferation of workloads using devices (like GPU), variable quality of devices, and overcommitting of data centers on power, there are cases when devices can fail temporarily or permanently and k8s need to handle this natively.
 
-Today, the typical design is for jobs consuming a failing device to fail with a specific error code whenever possible. For long running workloads, K8s will keep restarting the workload without reallocating it on a different device. So the container will be in crash loop backoff with limited information on why it is crashing.
+Today, the typical design is for jobs consuming a failing device to fail itself with the specific error code whenever possible. For the inference of long running workloads, k8s will keep restarting the workload without reallocating it on a different device. So container will be in crash loop backoff with limited information on why it is crashing.
 
-Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device, and be able to respond to this properly.
+People develop strategies to deal with such situations. Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device and be able to respond to this properly.
 
 ### Goals
 
@@ -134,12 +134,19 @@ type ResourceStatus struct {
     // allow to extend this struct in future with the overall health fields or things like Device Plugin version
 }
 
+// ResourceID is calculated based on source of this resource health information.
+// For DevicePlugin:
+//   deviceplugin:Device.ID, where Device.ID is from the Device structure of DevicePlugin's ListAndWatchResponse type: https://github.com/kubernetes/kubernetes/blob/eda1c780543a27c078450e2f17d674471e00f494/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1alpha/api.proto#L61-L73
+// DevicePlugin ID is usually a constant for the lifetime of a Node and typically can be used to uniquely identify the device on the node.
+// For DRA:
+//   dra:<driver name>[/<pool name>]/<device name>: such a device can be looked up in the information published by that DRA driver to learn more about it. It is designed to be globally unique in a cluster.
 type ResourceID string
 
 type ResourceHealth struct {
     // List of conditions with the transition times
     Conditions []ResourceHealthCondition
 }
+
 // This condition type is replicating other condition types exposed by various status APIs
 type ResourceHealthCondition struct {
     // can be one of:

From 00fc98423e24459a8c5038f4e8c00ca1a4fb7d7c Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 17:43:55 +0000
Subject: [PATCH 09/19] more reviewers

---
 keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml b/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
index f1a57c32e6d..45840e715a9 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
@@ -9,6 +9,9 @@ status: provisional #|implementable|implemented|deferred|rejected|withdrawn|repl
 creation-date: 2024-05-31
 reviewers:
   - "@ffromani"
+  - "@klueska"
+  - "@kad"
+  - "@pohly"
 approvers:
   - "@mrunalp"
 

From d28bffe7d5fc4277497ef6e2ba2e1cd6f0f4bae3 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 17:55:56 +0000
Subject: [PATCH 10/19] added DRA imlpementation overview

---
 .../4680-add-resource-health-to-pod-status/README.md | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index c023dc8ffce..7fd5cb62851 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -69,15 +69,15 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-Today it is hard to impossible to know when the Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP aims to fix this by exposing device health via Pod Status. The KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
+Today it is difficult to know when a Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP will fix this by exposing device health via Pod Status. This KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
 
 ## Motivation
 
 Device Plugin and DRA do not have a good failure handling strategy defined. With proliferation of workloads using devices (like GPU), variable quality of devices, and overcommitting of data centers on power, there are cases when devices can fail temporarily or permanently and k8s need to handle this natively.
 
-Today, the typical design is for jobs consuming a failing device to fail itself with the specific error code whenever possible. For the inference of long running workloads, k8s will keep restarting the workload without reallocating it on a different device. So container will be in crash loop backoff with limited information on why it is crashing.
+Today, the typical design is for jobs consuming a failing device to fail with a specific error code whenever possible. For long running workloads, K8s will keep restarting the workload without reallocating it on a different device. So the container will be in crash loop backoff with limited information on why it is crashing.
 
-People develop strategies to deal with such situations. Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device and be able to respond to this properly.
+Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device, and be able to respond to this properly.
 
 ### Goals
 
@@ -237,6 +237,12 @@ Today DRA does not return the health of the device back to kubelet. The proposal
 type `NamedResourcesInstance` (from [pkg/apis/resource/namedresources.go](https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/apis/resource/namedresources.go#L29-L37)) to include the Health field the same way it is done in 
 the Device Plugin as well as a device ID.
 
+In `1.30` we had a similar `ListAndWatch()` API as in DevicePlugin, from which we could have inferred something very analogous to the above. However, we are removing this in `1.31`, so will need to provide something different.
+
+An optional gRPC interface will be created, so DRA drivers can opt into this by implementing it. The interface will allow a plugin to stream health status information in as form of deviceIDs (of the form `<driver name>/<pool name>/<device name>`) along with extra metadata indicating its health status. Just as before, a device completely disappearing would still need to trigger some state change, but now more informative information could be attached in the form of metadata when a device isn't necessarily gone, but also isn't operating as it should be.
+
+The API will be limited to "prepared" devices and include the claim `name/namespace/UID`. That should be enough information for kubelet to correlate with the pods for which the claim was prepared and then post that information for those pods.
+
 Kubelet will react on this field the same way as we propose to do it for the Device Plugin.
 
 Specific implementation details will be added for the beta.

From bc1a642b1f2a98aea265ce86ab31ac01f9918ef6 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 21:35:08 +0000
Subject: [PATCH 11/19] marked as implementable

---
 .../sig-node/4680-add-resource-health-to-pod-status/README.md | 4 ++--
 keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index 7fd5cb62851..1cde12a5241 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -134,9 +134,9 @@ type ResourceStatus struct {
     // allow to extend this struct in future with the overall health fields or things like Device Plugin version
 }
 
-// ResourceID is calculated based on source of this resource health information.
+// ResourceID is calculated based on the source of this resource health information.
 // For DevicePlugin:
-//   deviceplugin:Device.ID, where Device.ID is from the Device structure of DevicePlugin's ListAndWatchResponse type: https://github.com/kubernetes/kubernetes/blob/eda1c780543a27c078450e2f17d674471e00f494/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1alpha/api.proto#L61-L73
+//   deviceplugin:DeviceID, where DeviceID is from the Device structure of DevicePlugin's ListAndWatchResponse type: https://github.com/kubernetes/kubernetes/blob/eda1c780543a27c078450e2f17d674471e00f494/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1alpha/api.proto#L61-L73
 // DevicePlugin ID is usually a constant for the lifetime of a Node and typically can be used to uniquely identify the device on the node.
 // For DRA:
 //   dra:<driver name>[/<pool name>]/<device name>: such a device can be looked up in the information published by that DRA driver to learn more about it. It is designed to be globally unique in a cluster.
diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml b/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
index 45840e715a9..921d15effde 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
@@ -5,7 +5,7 @@ authors:
 owning-sig: sig-node
 participating-sigs:
   - sig-node
-status: provisional #|implementable|implemented|deferred|rejected|withdrawn|replaced
+status: implementable #provisional|X|implemented|deferred|rejected|withdrawn|replaced
 creation-date: 2024-05-31
 reviewers:
   - "@ffromani"

From c9dde72547713dad24254578a16732342adf9ac9 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 21:47:30 +0000
Subject: [PATCH 12/19] switch the PRR reviewer

---
 keps/prod-readiness/sig-node/4680.yaml                        | 2 +-
 keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/keps/prod-readiness/sig-node/4680.yaml b/keps/prod-readiness/sig-node/4680.yaml
index 88de2f21eb9..005c2c973a6 100644
--- a/keps/prod-readiness/sig-node/4680.yaml
+++ b/keps/prod-readiness/sig-node/4680.yaml
@@ -1,3 +1,3 @@
 kep-number: 4680
 alpha:
-  approver: "@wojtek-t"
+  approver: "@johnbelamaric"
diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml b/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
index 921d15effde..e0f1700fbd8 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml
@@ -12,6 +12,7 @@ reviewers:
   - "@klueska"
   - "@kad"
   - "@pohly"
+  - "@johnbelamaric"
 approvers:
   - "@mrunalp"
 

From bac08eaddaa1ff5a861c8165b11ee2b2a7ae0ba1 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 22:04:53 +0000
Subject: [PATCH 13/19] changed PRR reviewer to jpbetz

---
 keps/prod-readiness/sig-node/4680.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/prod-readiness/sig-node/4680.yaml b/keps/prod-readiness/sig-node/4680.yaml
index 005c2c973a6..56e30ca35ee 100644
--- a/keps/prod-readiness/sig-node/4680.yaml
+++ b/keps/prod-readiness/sig-node/4680.yaml
@@ -1,3 +1,3 @@
 kep-number: 4680
 alpha:
-  approver: "@johnbelamaric"
+  approver: "@jpbetz"

From 8a058090e26bb0d08a23c521f04cbdee88807a11 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 16:51:39 -0700
Subject: [PATCH 14/19] Update
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Co-authored-by: Richa Banker <richabanker@google.com>
---
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index 1cde12a5241..b73b31af6fc 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -152,7 +152,7 @@ type ResourceHealthCondition struct {
     // can be one of:
     //  - Healthy: operates as normal
     //  - Unhealthy: reported unhealthy. We consider this a temporary health issue
-    //               since we do not mechanism today to distinguish
+    //               since we do not have a mechanism today to distinguish
     //               temporary and permanent issues.
     //  - Unknown: The status cannot be determined.
     //             For example, Device Plugin got unregistered and hasn't been re-registered since.

From 3b29c84ab3ab2511f1e9d61176f198a6143df32c Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 16:52:47 -0700
Subject: [PATCH 15/19] Update
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Co-authored-by: Richa Banker <richabanker@google.com>
---
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index b73b31af6fc..5ba70038b6f 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -214,7 +214,7 @@ This might be a good place to talk about core concepts and how they relate.
 ### Risks and Mitigations
 
 There is not many risks of this KEP. The biggest risk is that Device Plugins will not be
-able to detect device health reliable and fast enough to assign this status to the
+able to detect device health reliably and fast enough to assign this status to the
 Pods, marked as `restartPolicy=Never`. End users will expect this field and the
 missing health information will confuse them.
 

From 26e60df1f3d99b0bb303087f9f47918d721ba636 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 16:52:54 -0700
Subject: [PATCH 16/19] Update
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Co-authored-by: Richa Banker <richabanker@google.com>
---
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index 5ba70038b6f..83781176306 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -239,7 +239,7 @@ the Device Plugin as well as a device ID.
 
 In `1.30` we had a similar `ListAndWatch()` API as in DevicePlugin, from which we could have inferred something very analogous to the above. However, we are removing this in `1.31`, so will need to provide something different.
 
-An optional gRPC interface will be created, so DRA drivers can opt into this by implementing it. The interface will allow a plugin to stream health status information in as form of deviceIDs (of the form `<driver name>/<pool name>/<device name>`) along with extra metadata indicating its health status. Just as before, a device completely disappearing would still need to trigger some state change, but now more informative information could be attached in the form of metadata when a device isn't necessarily gone, but also isn't operating as it should be.
+An optional gRPC interface will be created, so DRA drivers can opt into this by implementing it. The interface will allow a plugin to stream health status information in the form of deviceIDs (of the form `<driver name>/<pool name>/<device name>`) along with extra metadata indicating its health status. Just as before, a device completely disappearing would still need to trigger some state change, but now more detailed information could be attached in the form of metadata when a device isn't necessarily gone, but also isn't operating as it should be.
 
 The API will be limited to "prepared" devices and include the claim `name/namespace/UID`. That should be enough information for kubelet to correlate with the pods for which the claim was prepared and then post that information for those pods.
 

From b684161e44c243b6eb1e581dcfa8e87a477ea349 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Wed, 12 Jun 2024 23:55:25 +0000
Subject: [PATCH 17/19] added the note that SLO is not applicable

---
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index 83781176306..04d0663a2c6 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -370,6 +370,8 @@ Check the Pod Status.
 
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
+N/A
+
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
 N/A

From 2f5c7a9d996ac07294b40b93c4c5b3d35bb35068 Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Thu, 13 Jun 2024 10:01:52 -0700
Subject: [PATCH 18/19] Update
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
---
 keps/sig-node/4680-add-resource-health-to-pod-status/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index 04d0663a2c6..5a423fd548e 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -139,7 +139,7 @@ type ResourceStatus struct {
 //   deviceplugin:DeviceID, where DeviceID is from the Device structure of DevicePlugin's ListAndWatchResponse type: https://github.com/kubernetes/kubernetes/blob/eda1c780543a27c078450e2f17d674471e00f494/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1alpha/api.proto#L61-L73
 // DevicePlugin ID is usually a constant for the lifetime of a Node and typically can be used to uniquely identify the device on the node.
 // For DRA:
-//   dra:<driver name>[/<pool name>]/<device name>: such a device can be looked up in the information published by that DRA driver to learn more about it. It is designed to be globally unique in a cluster.
+//   dra:<driver name>/<pool name>/<device name>: such a device can be looked up in the information published by that DRA driver to learn more about it. It is designed to be globally unique in a cluster.
 type ResourceID string
 
 type ResourceHealth struct {

From 126a79da0e00aa45f883026a0254fc236832e66a Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Thu, 13 Jun 2024 17:26:02 +0000
Subject: [PATCH 19/19] added metrics to PRR

---
 .../README.md                                 | 31 ++++++++++++++++---
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
index 5a423fd548e..1e2dc2df7a7 100644
--- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md
@@ -278,6 +278,12 @@ Planned tests:
 - Pod failed due to unhealthy device, earlier than device plugin detected it. Pod status is still updated.
 - Pod is in crash loop backoff due to unhealthy device - pod status is updated to unhealthy
 
+For alpha rollout and rollback:
+
+- Fields dropped on update when feature gate is disabled
+- Field is not populated after the feature gate is disabled
+- Field is populated again when the feature gate is enabled
+
 Test coverage will be listed once tests are implemented.
 
 - <test>: <link to test coverage>
@@ -330,15 +336,20 @@ No
 
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
-Yes, with no side effect except of missing the new field in pod status.
+Yes, with no side effect except of missing the new field in pod status. Values written 
+while the feature was enabled will continue to have it and may be wiped on next update request.
+They also may be ignored on reads.
+Re-enablement of the feature will not guarantee to keep the values written before the
+feature was disabled.
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 
-The pod status will be updated again.
+The pod status will be updated again. Consistency will not be guaranteed for fields written
+before the last enablement. 
 
 ###### Are there any tests for feature enablement/disablement?
 
-Nothing is planned.
+Yes, see in e2e tests section.
 
 ### Rollout, Upgrade and Rollback Planning
 
@@ -348,7 +359,10 @@ No
 
 ###### What specific metrics should inform a rollback?
 
-N/A
+API server error rate increase. `apiserver_request_total` filtered by `code` to be non `2xx`.
+API validation error is the most likely indication of an error.
+
+Potential errors on kubelet would likely be exposed as error logs and events on Pods.
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
@@ -378,7 +392,14 @@ N/A
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-N/A
+There are a few error modes for this feature:
+1. API issues accepting the new field - for example kubelet is writing the field in a format not acceptable by the API server
+2. kubelet fails while populating this field
+
+First error mode can be observer with the metric `apiserver_request_total` filtered by `code` to be non `2xx`.
+
+There is no good metric for the second error mode because it will not be clear what part of processing may fail.
+The most likely indication of an error would be the increased number of error events on the Pod.
 
 ### Dependencies