From 37d390ab805043c4ab85257d9550e3c4c9f3fd69 Mon Sep 17 00:00:00 2001 From: Sergey Kanzhelev Date: Fri, 31 May 2024 21:50:43 +0000 Subject: [PATCH] PRR filled up --- .../README.md | 512 +++--------------- 1 file changed, 79 insertions(+), 433 deletions(-) diff --git a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md index 8f6db4c6b9ec..c8bd0f26e1ee 100644 --- a/keps/sig-node/4680-add-resource-health-to-pod-status/README.md +++ b/keps/sig-node/4680-add-resource-health-to-pod-status/README.md @@ -8,18 +8,22 @@ - [Non-Goals](#non-goals) - [Proposal](#proposal) - [PodStatus.AllocatedResourcesStatus](#podstatusallocatedresourcesstatus) - - [Device Plugin implementation details](#device-plugin-implementation-details) - [User Stories (Optional)](#user-stories-optional) - [Story 1](#story-1) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) + - [Device Plugin implementation details](#device-plugin-implementation-details) + - [DRA implementation details](#dra-implementation-details) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) - [Integration tests](#integration-tests) - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) @@ -154,17 +158,6 @@ We may consider this as a future improvement. We may consider this as a future improvement. -### Device Plugin implementation details - -Kubelet already keeps track of healthy and unhealthy devices as well as the mapping of those devices to Pods. - -One improvement will be needed is to distinguish unhealthy devices (marked unhealthy explicitly) and when device plugin was unregistered. - -NVIDIA device plugin has the checkHealth implementation: https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39 that has more information than simple “Unhealthy”. - -We should consider introducing another field to the Status that will be a free form error information as a future improvement. - - ### User Stories (Optional) #### Story 1 @@ -186,562 +179,219 @@ This might be a good place to talk about core concepts and how they relate. ### Risks and Mitigations - +Kubelet already keeps track of healthy and unhealthy devices as well as the mapping of those devices to Pods. -## Design Details +One improvement will be needed is to distinguish unhealthy devices (marked unhealthy explicitly) and when device plugin was unregistered. - +NVIDIA device plugin has the checkHealth implementation: https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39 that has more information than simple “Unhealthy”. -### Test Plan +We should consider introducing another field to the Status that will be a free form error information as a future improvement. +### DRA implementation details - +Kubelet will react on this field the same way as we propose to do it for the Device Plugin. -[ ] I/we understand the owners of the involved components may require updates to +### Test Plan + +[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement. ##### Prerequisite testing updates - +Device Plugin and DRA are relatively new features and have a reasonable test coverage. ##### Unit tests - - - - -- ``: `` - `` +- `k8s.io/kubernetes/pkg/kubelet/cm/devicemanager`: `5/31/2024` - `84.1` +- `k8s.io/kubernetes/pkg/kubelet/cm/dra`: `5/31/2024` - `59.2` +- `k8s.io/kubernetes/pkg/kubelet/cm/dra/plugin`: `5/31/2024` - `34` +- `k8s.io/kubernetes/pkg/kubelet/cm/dra/state`: `5/31/2024` - `98` ##### Integration tests - - - - -- : +N/A ##### e2e tests - +Test coverage will be listed once tests are implemented. - : ### Graduation Criteria - +- Feedback is collected on usability of the field +- Example of real-world usage with one of the device plugin. For example, NVIDIA Device Plugin ### Upgrade / Downgrade Strategy - +The feature exposes a new field based on information the Device Plugin already exposes. There will be no dependency on upgrade/downgrade, feature will either work or not. + +DRA implementation requires DRA interfaces change. DRA is in alpha and in active development. The feature will follow the DRA ugrade/downgrade strategy. ### Version Skew Strategy - +There is no issue with the version skew. Kubelet that will expose this flag will +always be the same version of behind the API, which introduced this new field. ## Production Readiness Review Questionnaire - - ### Feature Enablement and Rollback - +Simple change of a feature gate will either enable or disable this feature. ###### How can this feature be enabled / disabled in a live cluster? - -- [ ] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: - - Components depending on the feature gate: -- [ ] Other - - Describe the mechanism: - - Will enabling / disabling the feature require downtime of the control - plane? - - Will enabling / disabling the feature require downtime or reprovisioning - of a node? +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `ResourceHealthStatus` + - Components depending on the feature gate: `kubelet` and `kube-apiserver` ###### Does enabling the feature change any default behavior? - +No ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? - +Yes, with no side effect except of missing the new field in pod status. ###### What happens if we reenable the feature if it was previously rolled back? +The pod status will be updated again. + ###### Are there any tests for feature enablement/disablement? - +Nothing is planned. ### Rollout, Upgrade and Rollback Planning - - ###### How can a rollout or rollback fail? Can it impact already running workloads? - +No ###### What specific metrics should inform a rollback? - +N/A ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? - +Will be tested, but we do not expect any issues. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? - +No ### Monitoring Requirements - - ###### How can an operator determine if the feature is in use by workloads? - +Check the Pod Status. ###### How can someone using this feature know that it is working for their instance? - - -- [ ] Events - - Event Reason: -- [ ] API .status - - Condition name: - - Other field: -- [ ] Other (treat as last resort) - - Details: +- [X] API pod.status ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? - - ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - - -- [ ] Metrics - - Metric name: - - [Optional] Aggregation method: - - Components exposing the metric: -- [ ] Other (treat as last resort) - - Details: +N/A ###### Are there any missing metrics that would be useful to have to improve observability of this feature? - +N/A ### Dependencies - +DRA implementation. ###### Does this feature depend on any specific services running in the cluster? - +No ### Scalability - - ###### Will enabling / using this feature result in any new API calls? - +Pod Status size will increase insignificantly. ###### Will enabling / using this feature result in introducing new API types? - +New field on Pod Status. ###### Will enabling / using this feature result in any new calls to the cloud provider? - +No ###### Will enabling / using this feature result in increasing size or count of the existing API objects? - +Pod Status size will increase insignificantly. ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? - +No ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? - +Not significantly. We already keep all the collection in memory, just need to connect dots. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? - +No ### Troubleshooting - - ###### How does this feature react if the API server and/or etcd is unavailable? +N/A + ###### What are other known failure modes? - +Not applicable. ###### What steps should be taken if SLOs are not being met to determine the problem? ## Implementation History - +- `v1.31`: KEP is in alpha ## Drawbacks - +Not that we can think of. ## Alternatives @@ -755,8 +405,4 @@ There are a few alternatives to this proposal. ## Infrastructure Needed (Optional) - +We may need to update sample device plugin. No special infra is needed as emulating real GPU failures or failures in other devices is not practical.