-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new TSG for troubleshooting AKS node auto-repair errors #1629
base: main
Are you sure you want to change the base?
Changes from 6 commits
e5d9cee
70d79b6
7f618c5
fe07a9f
7ea8aeb
a6a6b91
2dbb198
884f448
90cce7b
1d1fc78
68d79e2
cfc0274
3369d8d
87efc83
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
--- | ||
title: Troubleshoot common node auto-repair errors | ||
description: Troubleshoot scenarios where node auto-repair returns an error code when trying to repair your NotReady node. | ||
ms.date: 10/01/2024 | ||
ms.reviewer: | ||
ms.service: azure-kubernetes-service | ||
#Customer intent: As an Azure Kubernetes user, I want to make sure the automatic repair actions from AKS node auto-repair do not cause any impacts on my applications or cluster health. | ||
ms.custom: sap:Node/node pool availability and performance | ||
--- | ||
# Troubleshoot common node auto-repair errors | ||
|
||
When AKS detects a node which reports the NotReady status for more than 5 minutes, we will attempt to automatically repair your node. Learn more about the node auto-repair process [here](https://learn.microsoft.com/en-us/azure/aks/node-auto-repair). | ||
|
||
During this process, AKS will initiate reboot, reimage, and redeploy actions on your unhealthy node. These repair actions may be unsuccessful due to an underlying cause, resulting in an error code. This article will discuss common errors with their potential causes and next steps, as well as best practices for monitoring node auto-repair. | ||
|
||
## Prerequisites | ||
To determine what type of node auto-repair error has occured, you will need to look for the following Kubernetes event: | ||
"Node auto-repair [reboot/reimage/redeploy] action failed due to an operation failure: [error code]." | ||
|
||
## Common error codes | ||
The table below contains the most common node auto-repair errors. | ||
|
||
| Error code | Potential causes | Next steps | | ||
|---|---|---| | ||
| ClientSecretCredential authentication failed | | | | ||
| ARM ErrorCode: VMExtensionProvisioningError | | | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. probably relevant sometimes: https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/availability-performance/node-not-ready-custom-script-extension-errors sometimes this or the other specific CSE exit code TSGs: https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/error-code-outboundconnfailvmextensionerror |
||
| ARM ErrorCode: InvalidParameter | | | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is mostly an issue with spot VM node objects still existing for a little while after the VMs are preempted. |
||
| scaleSetNameAndInstanceIDFromProviderID failed | | | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems to be uninitialized nodes |
||
| ManagedIdentityCredential authentication failed | | | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks good, as long as we have some common errors , next steps. This is good to go. |
||
| ARM ErrorCode: VMRedeploymentFailed | | | | ||
| ARM ErrorCode: TooManyVMRedeploymentRequests | | | | ||
| ARM ErrorCode: OutboundConnectivityNotEnabledOnVMSS | | | | ||
| ARM ErrorCode: NotFound | | | | ||
|
||
|
||
## Best practices for monitoring | ||
|
||
- By default, AKS stores Kubernetes events from the past 1 hour. We recommend for you to enable Container Insights to store events for up to 90 days. Enabling Container Insights will also allow you to query events and configure alerts to quickly detect node auto-repair errors. | ||
- Configure alerts to quickly detect when node auto-repair errors occur. To configure an alert on a specific event, see instructions [here](LINK). | ||
- Node auto-repair is a best-effort service and does not guarantee that your node will be restored back to Ready status. We highly recommend that you actively monitor and alert on node NotReady issues, and conduct your own troubleshooting and resolution of these issues. See [basic troubleshooting of node NotReady issues](LINK) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned on teams, I think we can remove this one and add some other top errors. Since this is an issue with misclassifying some MSI clusters which there is a fix rolling out for.