Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new TSG for troubleshooting AKS node auto-repair errors #1629

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
title: Troubleshoot common node auto-repair errors
description: Troubleshoot scenarios where node auto-repair returns an error code when trying to repair your NotReady node.
ms.date: 10/01/2024
ms.reviewer:
ms.service: azure-kubernetes-service
#Customer intent: As an Azure Kubernetes user, I want to make sure the automatic repair actions from AKS node auto-repair do not cause any impacts on my applications or cluster health.
ms.custom: sap:Node/node pool availability and performance
---
# Troubleshoot common node auto-repair errors

When AKS detects a node which reports the NotReady status for more than 5 minutes, we will attempt to automatically repair your node. Learn more about the node auto-repair process [here](https://learn.microsoft.com/en-us/azure/aks/node-auto-repair).

During this process, AKS will initiate reboot, reimage, and redeploy actions on your unhealthy node. These repair actions may be unsuccessful due to an underlying cause, resulting in an error code. This article will discuss common errors with their potential causes and next steps, as well as best practices for monitoring node auto-repair.

## Prerequisites
To determine what type of node auto-repair error has occured, you will need to look for the following Kubernetes event:
"Node auto-repair [reboot/reimage/redeploy] action failed due to an operation failure: [error code]."

## Common error codes
The table below contains the most common node auto-repair errors.

| Error code | Potential causes | Next steps |
|---|---|---|
| ClientSecretCredential authentication failed | | |
Copy link

@shanalily shanalily Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned on teams, I think we can remove this one and add some other top errors. Since this is an issue with misclassifying some MSI clusters which there is a fix rolling out for.

| ARM ErrorCode: VMExtensionProvisioningError | | |
Copy link

@shanalily shanalily Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| ARM ErrorCode: InvalidParameter | | |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mostly an issue with spot VM node objects still existing for a little while after the VMs are preempted.

| scaleSetNameAndInstanceIDFromProviderID failed | | |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be uninitialized nodes

| ManagedIdentityCredential authentication failed | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, as long as we have some common errors , next steps. This is good to go.

| ARM ErrorCode: VMRedeploymentFailed | | |
| ARM ErrorCode: TooManyVMRedeploymentRequests | | |
| ARM ErrorCode: OutboundConnectivityNotEnabledOnVMSS | | |
| ARM ErrorCode: NotFound | | |


## Best practices for monitoring

- By default, AKS stores Kubernetes events from the past 1 hour. We recommend for you to enable Container Insights to store events for up to 90 days. Enabling Container Insights will also allow you to query events and configure alerts to quickly detect node auto-repair errors.
- Configure alerts to quickly detect when node auto-repair errors occur. To configure an alert on a specific event, see instructions [here](LINK).
- Node auto-repair is a best-effort service and does not guarantee that your node will be restored back to Ready status. We highly recommend that you actively monitor and alert on node NotReady issues, and conduct your own troubleshooting and resolution of these issues. See [basic troubleshooting of node NotReady issues](LINK) for more details.
2 changes: 2 additions & 0 deletions support/azure/azure-kubernetes/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,8 @@
href: availability-performance/node-not-ready-after-being-healthy.md
- name: Node not ready but then recovers
href: availability-performance/node-not-ready-then-recovers.md
- name: Troubleshoot node auto-repair errors
href: availability-performance/node-auto-repair-errors.md
- name: Connectivity
items:
- name: Cannot connect to application hosted on AKS cluster
Expand Down
4 changes: 3 additions & 1 deletion support/azure/azure-kubernetes/welcome-azure-kubernetes.yml
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,8 @@ landingContent:
url: ./node-not-ready-after-being-healthy.md
- text: Node not ready but then recovers
url: ./node-not-ready-then-recovers.md
- text: Troubleshoot node auto-repair errors
url: ./node-auto-repair-errors.md

# Card
- title: Cannot connect to application hosted on AKS cluster
Expand Down Expand Up @@ -189,4 +191,4 @@ landingContent:
- linkListType: how-to-guide
links:
- text: Troubleshoot common issues with Azure Linux container hosts on AKS
url: ./troubleshoot-common-azure-linux-aks.md
url: ./troubleshoot-common-azure-linux-aks.md