-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add application and Azure troubleshooting codebundles (#225)
* Add app troubleshoot codebundle basis * Refine env check output * Add azure loadbalancer triage * Add azure monitor codebundles
- Loading branch information
Showing
10 changed files
with
621 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Azure LoadBalancer Triage | ||
|
||
Queries the activity logs of internal loadbalancers (AKS ingress) objects in Azure and optionally inspects internal AKS ingress objects if available. | ||
|
||
## Tasks | ||
`Health Check Internal Azure Load Balancer` | ||
|
||
## Configuration | ||
The TaskSet requires initialization to import necessary secrets, services, and user variables. The following variables should be set: | ||
|
||
- `AZ_USERNAME`: Azure service account username secret used to authenticate. | ||
- `AZ_CLIENT_SECRET`: Azure service account client secret used to authenticate. | ||
- `AZ_TENANT`: Azure tenant ID used to authenticate to. | ||
- `AZ_HISTORY_RANGE`: The history range to inspect for incidents in the activity log, in hours. Defaults to 24 hours. | ||
|
||
## Requirements | ||
- A kubeconfig with appropriate RBAC permissions to perform the desired command. | ||
|
||
## TODO | ||
- [ ] Refine issues raised | ||
- [ ] Array support for issues | ||
- [ ] Look at cross az/kubectl for better triage | ||
- [ ] Add additional documentation. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
*** Settings *** | ||
Documentation Triages issues related to a Azure Loadbalancers and its activity logs. | ||
Metadata Author jon-funk | ||
Metadata Display Name Azure Internal LoadBalancer Triage | ||
Metadata Supports Kubernetes,AKS,Azure | ||
|
||
Library BuiltIn | ||
Library RW.Core | ||
Library RW.CLI | ||
Library RW.platform | ||
Library OperatingSystem | ||
|
||
Suite Setup Suite Initialization | ||
|
||
|
||
*** Tasks *** | ||
Health Check Internal Azure Load Balancer | ||
[Documentation] Queries a Azure Loadbalancer's health probe to determine if it's in a healthy state. | ||
[Tags] load balancer azure | ||
${lb_id}= RW.CLI.Run Cli | ||
... cmd=az login --service-principal -u $${AZ_USERNAME.key} -p $${AZ_CLIENT_SECRET.key} --tenant $${AZ_TENANT.key} > /dev/null 2>&1 && az network lb list --query "[?name=='${AZ_LB_NAME}']" | jq -r '.[0].id' | ||
... secret__az_username=${AZ_USERNAME} | ||
... secret__az_client_secret=${AZ_CLIENT_SECRET} | ||
... secret__az_tenant=${AZ_TENANT} | ||
${activity_logs}= RW.CLI.Run Cli | ||
... cmd=START_TIME=$(date -d "${AZ_HISTORY_RANGE} hours ago" '+%Y-%m-%dT%H:%M:%SZ') && END_TIME=$(date '+%Y-%m-%dT%H:%M:%SZ') && az login --service-principal -u $${AZ_USERNAME.key} -p $${AZ_CLIENT_SECRET.key} --tenant $${AZ_TENANT.key} > /dev/null 2>&1 && az monitor activity-log list --start-time $START_TIME --end-time $END_TIME --query "[?resourceType.value=='MICROSOFT.NETWORK/loadbalancers' && resourceId=='${lb_id.stdout}']" | jq -r '.[] | [(.eventTimestamp // "N/A"), (.status.localizedValue // "N/A"), (.subStatus.localizedValue // "N/A"), (.properties.details // "N/A")] | @tsv' | while IFS=$'\t' read -r timestamp status substatus details; do printf "%-30s | %-30s | %-60s | %s\n" "$timestamp" "$status" "$substatus" "$details"; done | ||
... secret__az_username=${AZ_USERNAME} | ||
... secret__az_client_secret=${AZ_CLIENT_SECRET} | ||
... secret__az_tenant=${AZ_TENANT} | ||
${activity_logs_report}= Set Variable "Azure Load Balancer Health Report:" | ||
IF """${activity_logs.stdout}""" == "" | ||
${activity_logs_report}= Set Variable | ||
... "${activity_logs_report}\n\nNo activity log events could be pulled for this resource. If there are events, consider checking the configured time range." | ||
ELSE | ||
${activity_logs_report}= Set Variable | ||
... "${activity_logs_report}\ntimestamp status substatus details\n${activity_logs.stdout}" | ||
END | ||
RW.CLI.Parse Cli Output By Line | ||
... rsp=${activity_logs} | ||
... set_severity_level=2 | ||
... set_issue_expected=No activity logs indicating failures for the resource. | ||
... set_issue_actual=Found activity logs indicating the resource has recently experienced an error. | ||
... set_issue_title=Load Balancer Activity Log Indicates Recent Errors | ||
... set_issue_details=Activity Log History\n\n${activity_logs.stdout} | ||
... set_issue_next_steps=Run 'az aks get-credentials' and with the credentials/context provided, use `kubectl describe service -l service.beta.kubernetes.io/azure-load-balancer-internal=true' to get a list of services and inspect their selectors. If the selectors are correct, begin troubleshooting the resource the selectors point to. | ||
... _line__raise_issue_if_contains=Critical | ||
${history}= RW.CLI.Pop Shell History | ||
RW.Core.Add Pre To Report ${activity_logs_report} | ||
RW.Core.Add Pre To Report Commands Used: ${history} | ||
|
||
|
||
*** Keywords *** | ||
Suite Initialization | ||
${AZ_USERNAME}= RW.Core.Import Secret | ||
... AZ_USERNAME | ||
... type=string | ||
... description=The azure service principal user ID. | ||
... pattern=\w* | ||
${AZ_CLIENT_SECRET}= RW.Core.Import Secret | ||
... AZ_CLIENT_SECRET | ||
... type=string | ||
... description=The service principal client secret used to authenticate with azure. | ||
... pattern=\w* | ||
${AZ_TENANT}= RW.Core.Import Secret | ||
... AZ_TENANT | ||
... type=string | ||
... description=The azure tenant ID used by the service principal to authenticate with azure. | ||
... pattern=\w* | ||
${AZ_HISTORY_RANGE}= RW.Core.Import User Variable | ||
... AZ_HISTORY_RANGE | ||
... type=string | ||
... description=The range of history to check for incidents in the activity log, in hours. | ||
... pattern=\w* | ||
... default=24 | ||
... example=24 | ||
${AZ_LB_NAME}= RW.Core.Import User Variable | ||
... AZ_LB_NAME | ||
... type=string | ||
... description=The name of the Azure loadbalancer resource, used to map to activity log events. | ||
... pattern=\w* | ||
... example=kubernetes-internal | ||
... example=kubernetes-internal | ||
Set Suite Variable ${AZ_USERNAME} ${AZ_USERNAME} | ||
Set Suite Variable ${AZ_CLIENT_SECRET} ${AZ_CLIENT_SECRET} | ||
Set Suite Variable ${AZ_TENANT} ${AZ_TENANT} | ||
Set Suite Variable ${AZ_HISTORY_RANGE} ${AZ_HISTORY_RANGE} | ||
Set Suite Variable ${AZ_LB_NAME} ${AZ_LB_NAME} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Azure Monitor Event Triage | ||
|
||
This codebundle queries for general activity log issues and raises them in a tabular report. | ||
|
||
## Tasks | ||
`Run Azure Monitor Activity Log Triage` | ||
|
||
## Configuration | ||
The TaskSet requires initialization to import necessary secrets, services, and user variables. The following variables should be set: | ||
|
||
- `AZ_USERNAME`: Azure service account username secret used to authenticate. | ||
- `AZ_CLIENT_SECRET`: Azure service account client secret used to authenticate. | ||
- `AZ_TENANT`: Azure tenant ID used to authenticate to. | ||
- `AZ_HISTORY_RANGE`: The history range to inspect for incidents in the activity log, in hours. Defaults to 24 hours. | ||
|
||
## Requirements | ||
- The azure service principal should have access to the azure monitor API. | ||
|
||
## TODO | ||
- [ ] Additional tasks | ||
- [ ] Refine next steps | ||
- [ ] Array support for issues | ||
- [ ] Add additional documentation. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
*** Settings *** | ||
Documentation Triages issues related to a Azure Loadbalancers, Kubernetes ingress objects and services. | ||
Metadata Author jon-funk | ||
Metadata Display Name Azure Monitor Event Triage | ||
Metadata Supports Kubernetes,AKS,Azure | ||
|
||
Library BuiltIn | ||
Library RW.Core | ||
Library RW.CLI | ||
Library RW.platform | ||
Library OperatingSystem | ||
|
||
Suite Setup Suite Initialization | ||
|
||
|
||
*** Tasks *** | ||
Run Azure Monitor Activity Log Triage | ||
[Documentation] Queries a Azure Loadbalancer's health probe to determine if it's in a healthy state. | ||
[Tags] load balancer azure | ||
${activity_logs}= RW.CLI.Run Cli | ||
... cmd=START_TIME=$(date -d "${AZ_HISTORY_RANGE} hours ago" '+%Y-%m-%dT%H:%M:%SZ') && END_TIME=$(date '+%Y-%m-%dT%H:%M:%SZ') && az login --service-principal -u $${AZ_USERNAME.key} -p $${AZ_CLIENT_SECRET.key} --tenant $${AZ_TENANT.key} > /dev/null 2>&1 && az monitor activity-log list --start-time $START_TIME --end-time $END_TIME | jq -r '.[] | [(.eventTimestamp // "N/A"), (.status.localizedValue // "N/A"), (.subStatus.localizedValue // "N/A"), (.properties.details // "N/A")] | @tsv' | while IFS=$'\t' read -r timestamp status substatus details; do printf "%-30s | %-30s | %-60s | %s\n" "$timestamp" "$status" "$substatus" "$details"; done | ||
... secret__az_username=${AZ_USERNAME} | ||
... secret__az_client_secret=${AZ_CLIENT_SECRET} | ||
... secret__az_tenant=${AZ_TENANT} | ||
${activity_logs_report}= Set Variable "Azure Monitor Activity Log Report:" | ||
IF """${activity_logs.stdout}""" == "" | ||
${activity_logs_report}= Set Variable | ||
... "${activity_logs_report}\n\nNo activity log events could be pulled in the Azure Tenancy." | ||
ELSE | ||
${activity_logs_report}= Set Variable | ||
... "${activity_logs_report}\ntimestamp status substatus details\n${activity_logs.stdout}" | ||
END | ||
RW.CLI.Parse Cli Output By Line | ||
... rsp=${activity_logs} | ||
... set_severity_level=2 | ||
... set_issue_expected=No activity logs indicating failures for the resource. | ||
... set_issue_actual=Found activity logs indicating the resource has recently experienced an error. | ||
... set_issue_title=Azure Monitor Activity Log Indicates Recent Errors | ||
... set_issue_details=Activity Log History\n\n${activity_logs.stdout} | ||
... set_issue_next_steps=Inspect the status, substatus, and details of the activity log report for more details. | ||
... _line__raise_issue_if_contains=Critical | ||
${history}= RW.CLI.Pop Shell History | ||
RW.Core.Add Pre To Report ${activity_logs_report} | ||
RW.Core.Add Pre To Report Commands Used: ${history} | ||
|
||
|
||
*** Keywords *** | ||
Suite Initialization | ||
${AZ_USERNAME}= RW.Core.Import Secret | ||
... AZ_USERNAME | ||
... type=string | ||
... description=The azure service principal user ID. | ||
... pattern=\w* | ||
${AZ_CLIENT_SECRET}= RW.Core.Import Secret | ||
... AZ_CLIENT_SECRET | ||
... type=string | ||
... description=The service principal client secret used to authenticate with azure. | ||
... pattern=\w* | ||
${AZ_TENANT}= RW.Core.Import Secret | ||
... AZ_TENANT | ||
... type=string | ||
... description=The azure tenant ID used by the service principal to authenticate with azure. | ||
... pattern=\w* | ||
${AZ_HISTORY_RANGE}= RW.Core.Import User Variable | ||
... AZ_HISTORY_RANGE | ||
... type=string | ||
... description=The range of history to check for incidents in the activity log, in hours. | ||
... pattern=\w* | ||
... default=24 | ||
... example=24 | ||
Set Suite Variable ${AZ_USERNAME} ${AZ_USERNAME} | ||
Set Suite Variable ${AZ_CLIENT_SECRET} ${AZ_CLIENT_SECRET} | ||
Set Suite Variable ${AZ_TENANT} ${AZ_TENANT} | ||
Set Suite Variable ${AZ_HISTORY_RANGE} ${AZ_HISTORY_RANGE} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
*** Settings *** | ||
Documentation Measures the count of error activity log entries as a SLI metric for the Azure tenancy. | ||
Metadata Author jon-funk | ||
Metadata Display Name Azure Monitor Activity Log SLI | ||
Metadata Supports Kubernetes,AKS,Azure | ||
|
||
Library BuiltIn | ||
Library RW.Core | ||
Library RW.CLI | ||
Library RW.platform | ||
Library OperatingSystem | ||
|
||
Suite Setup Suite Initialization | ||
|
||
|
||
*** Tasks *** | ||
Run Azure Monitor Activity Log Triage | ||
[Documentation] Queries a Azure Loadbalancer's health probe to determine if it's in a healthy state. | ||
[Tags] load balancer azure | ||
${activity_logs_count}= RW.CLI.Run Cli | ||
... cmd=START_TIME=$(date -d "${AZ_HISTORY_RANGE} hours ago" '+%Y-%m-%dT%H:%M:%SZ') && END_TIME=$(date '+%Y-%m-%dT%H:%M:%SZ') && az login --service-principal -u $${AZ_USERNAME.key} -p $${AZ_CLIENT_SECRET.key} --tenant $${AZ_TENANT.key} > /dev/null 2>&1 && az monitor activity-log list --start-time $START_TIME --end-time $END_TIME --status Failed --status Error --status Critical --status "In Progress" | jq -r '. | length' | ||
... secret__az_username=${AZ_USERNAME} | ||
... secret__az_client_secret=${AZ_CLIENT_SECRET} | ||
... secret__az_tenant=${AZ_TENANT} | ||
${history}= RW.CLI.Pop Shell History | ||
Log Running: ${history} resulted in the following count: ${activity_logs_count} | ||
RW.Core.Push Metric ${activity_logs_count} | ||
|
||
|
||
*** Keywords *** | ||
Suite Initialization | ||
${AZ_USERNAME}= RW.Core.Import Secret | ||
... AZ_USERNAME | ||
... type=string | ||
... description=The azure service principal user ID. | ||
... pattern=\w* | ||
${AZ_CLIENT_SECRET}= RW.Core.Import Secret | ||
... AZ_CLIENT_SECRET | ||
... type=string | ||
... description=The service principal client secret used to authenticate with azure. | ||
... pattern=\w* | ||
${AZ_TENANT}= RW.Core.Import Secret | ||
... AZ_TENANT | ||
... type=string | ||
... description=The azure tenant ID used by the service principal to authenticate with azure. | ||
... pattern=\w* | ||
${AZ_HISTORY_RANGE}= RW.Core.Import User Variable | ||
... AZ_HISTORY_RANGE | ||
... type=string | ||
... description=The range of history to check for incidents in the activity log, in hours. | ||
... pattern=\w* | ||
... default=24 | ||
... example=24 | ||
Set Suite Variable ${AZ_USERNAME} ${AZ_USERNAME} | ||
Set Suite Variable ${AZ_CLIENT_SECRET} ${AZ_CLIENT_SECRET} | ||
Set Suite Variable ${AZ_TENANT} ${AZ_TENANT} | ||
Set Suite Variable ${AZ_HISTORY_RANGE} ${AZ_HISTORY_RANGE} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Kubernetes Application Troubleshoot | ||
|
||
This codebundle attempts to identify issues created in application code changes recently. Currently focuses on environment misconfigurations. | ||
|
||
## Tasks | ||
`Get Resource Logs` | ||
`Scan For Misconfigured Environment` | ||
|
||
## Configuration | ||
The TaskSet requires initialization to import necessary secrets, services, and user variables. The following variables should be set: | ||
|
||
- `kubeconfig`: The kubeconfig secret containing access info for the cluster. | ||
- `kubectl`: The location service used to interpret shell commands. Default value is `kubectl-service.shared`. | ||
- `KUBERNETES_DISTRIBUTION_BINARY`: Which binary to use for Kubernetes CLI commands. Default value is `kubectl`. | ||
- `CONTEXT`: The Kubernetes context to operate within. | ||
- `NAMESPACE`: The name of the namespace to search. Leave it blank to search in all namespaces. | ||
- `LABELS`: The labaels used for resource selection, particularly for fetching logs. | ||
- `REPO_URI`: The URI for the git repo used to fetch source code, can be a GitHub URL. | ||
- `NUM_OF_COMMITS`: How many commits to search through into the past to identify potential problems. | ||
|
||
## Requirements | ||
- A kubeconfig with appropriate RBAC permissions to perform the desired command. | ||
|
||
## TODO | ||
- [ ] New keywords for code inspection | ||
- [ ] SPIKE for potential genAI integration | ||
- [ ] Add additional documentation. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
#!/bin/bash | ||
# ----------------------------------------------------------------------------- | ||
# Script Information and Metadata | ||
# ----------------------------------------------------------------------------- | ||
# Author: @jon-funk | ||
# Description: This script checks a resource's logs for errors or potential problems | ||
# related to environment variables and attempts to pinpoint them to recent code changes in the repo. | ||
# ----------------------------------------------------------------------------- | ||
|
||
# Setup error handling | ||
set -Euo pipefail | ||
# Function to handle errors | ||
function handle_error() { | ||
local line_number=$1 | ||
local function_name=$2 | ||
local error_code=$3 | ||
echo "Error occurred in function '$function_name' at line $line_number with error code $error_code" | ||
} | ||
# Trap error signals to error handler function | ||
trap 'handle_error $LINENO $FUNCNAME $?' ERR | ||
|
||
# Check if kubectl is available | ||
if ! command -v kubectl &> /dev/null; then | ||
echo "kubectl command not found!" | ||
exit 1 | ||
fi | ||
|
||
# Check for namespace argument | ||
if [ -z "$NAMESPACE" ] || [ -z "$CONTEXT" ] || [ -z "$LABELS" ] || [ -z "$REPO_URI" ] || [ -z "$NUM_OF_COMMITS" ]; then | ||
echo "Please set the NAMESPACE, LABELS, REPO_URI, NUM_OF_COMMITS and CONTEXT environment variables" | ||
exit 1 | ||
fi | ||
|
||
APPLOGS=$(kubectl -n ${NAMESPACE} --context ${CONTEXT} logs deployment,statefulset -l ${LABELS} --all-containers --tail=50 --limit-bytes=256000 | grep -i env || true) | ||
APP_REPO_PATH=/tmp/app_repo | ||
git clone $REPO_URI $APP_REPO_PATH || true | ||
cd $APP_REPO_PATH | ||
|
||
changes_to_investigate="" | ||
for word in $APPLOGS; do | ||
checkpath=$(echo "$word" | tr ' ' '\n' | xargs -I{} grep -rin "{}" | grep -E "environment|env" | grep -oE "[A-Z_]{3,}" | sort | uniq || true) | ||
changes_to_investigate+="${checkpath}\n" | ||
done; | ||
changes_to_investigate=$(echo -e $changes_to_investigate | sed 's/ /\n/g' | sort | uniq | sed 's/ /\n/g') | ||
# echo -e $changes_to_investigate | ||
|
||
GIT_URL=$(git remote get-url origin | sed -E 's/git@github.com:/https:\/\/github.com\//' | sed 's/.git$//') | ||
BRANCH=$(git rev-parse --abbrev-ref HEAD) | ||
|
||
# Create git changes filter for final result | ||
MODIFIED_FILES=$(mktemp) | ||
for word in $changes_to_investigate; do | ||
git diff HEAD~$NUM_OF_COMMITS HEAD --name-only -S "$word" >> "$MODIFIED_FILES" | ||
done | ||
MODIFIED_FILES=$(cat "$MODIFIED_FILES" | sort | uniq) | ||
# echo -e $MODIFIED_FILES | sed 's/ /\n/g' | ||
|
||
# Temporary file to store results | ||
TEMPFILE=$(mktemp) | ||
# Search for the words and generate GitHub links with line numbers | ||
for word in $changes_to_investigate; do | ||
grep -rn "$word" . | while IFS=: read -r file line content; do | ||
if echo "$MODIFIED_FILES" | sed 's/ /\n/g' | grep -qF "$(basename $file)"; then | ||
echo "$GIT_URL/blob/$BRANCH/$file#L$line" >> "$TEMPFILE" | ||
fi | ||
done | ||
done | ||
|
||
# Sort, make unique and print the results | ||
sort "$TEMPFILE" | uniq | ||
|
||
if [[ -n "$changes_to_investigate" ]]; then | ||
echo -e "We found the following Environment variables in the logs, which may indicate a problem with them.\n" | ||
echo -e $(echo -e $changes_to_investigate | sed 's/ /\n/g') | ||
echo -e "\n\n" | ||
else | ||
echo -e "No potential environment variable issues were found in the recent logs." | ||
fi | ||
|
||
if [[ -n "$MODIFIED_FILES" ]]; then | ||
echo "They appear in the following files changed within the last $NUM_OF_COMMITS commits." | ||
echo -e $MODIFIED_FILES | sed 's/ /\n/g' | ||
echo -e "\n\n" | ||
else | ||
echo -e "No files were found that contain detected potential environment variable issues.\n" | ||
fi | ||
|
||
|
||
repo_links=$(cat "$TEMPFILE") | ||
# echo $repo_links | ||
echo "Suggested Next Steps:" | ||
if [[ -n "$repo_links" ]]; then | ||
echo "Investigate the following files for changes related to environment variables" | ||
echo -e "$repo_links" | ||
else | ||
echo "No root-cause files could be found with the available information, check the repo $REPO_URI manaually." | ||
fi | ||
|
||
# Clean up the temporary file | ||
rm "$TEMPFILE" |
Oops, something went wrong.