Skip to content

Commit

Permalink
Add application and Azure troubleshooting codebundles (#225)
Browse files Browse the repository at this point in the history
* Add app troubleshoot codebundle basis

* Refine env check output

* Add azure loadbalancer triage

* Add azure monitor codebundles
  • Loading branch information
jon-funk authored Oct 30, 2023
1 parent 6d6b459 commit 28f452a
Show file tree
Hide file tree
Showing 10 changed files with 621 additions and 0 deletions.
24 changes: 24 additions & 0 deletions codebundles/azure-loadbalancer-triage/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Azure LoadBalancer Triage

Queries the activity logs of internal loadbalancers (AKS ingress) objects in Azure and optionally inspects internal AKS ingress objects if available.

## Tasks
`Health Check Internal Azure Load Balancer`

## Configuration
The TaskSet requires initialization to import necessary secrets, services, and user variables. The following variables should be set:

- `AZ_USERNAME`: Azure service account username secret used to authenticate.
- `AZ_CLIENT_SECRET`: Azure service account client secret used to authenticate.
- `AZ_TENANT`: Azure tenant ID used to authenticate to.
- `AZ_HISTORY_RANGE`: The history range to inspect for incidents in the activity log, in hours. Defaults to 24 hours.

## Requirements
- A kubeconfig with appropriate RBAC permissions to perform the desired command.

## TODO
- [ ] Refine issues raised
- [ ] Array support for issues
- [ ] Look at cross az/kubectl for better triage
- [ ] Add additional documentation.

87 changes: 87 additions & 0 deletions codebundles/azure-loadbalancer-triage/runbook.robot
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
*** Settings ***
Documentation Triages issues related to a Azure Loadbalancers and its activity logs.
Metadata Author jon-funk
Metadata Display Name Azure Internal LoadBalancer Triage
Metadata Supports Kubernetes,AKS,Azure

Library BuiltIn
Library RW.Core
Library RW.CLI
Library RW.platform
Library OperatingSystem

Suite Setup Suite Initialization


*** Tasks ***
Health Check Internal Azure Load Balancer
[Documentation] Queries a Azure Loadbalancer's health probe to determine if it's in a healthy state.
[Tags] load balancer azure
${lb_id}= RW.CLI.Run Cli
... cmd=az login --service-principal -u $${AZ_USERNAME.key} -p $${AZ_CLIENT_SECRET.key} --tenant $${AZ_TENANT.key} > /dev/null 2>&1 && az network lb list --query "[?name=='${AZ_LB_NAME}']" | jq -r '.[0].id'
... secret__az_username=${AZ_USERNAME}
... secret__az_client_secret=${AZ_CLIENT_SECRET}
... secret__az_tenant=${AZ_TENANT}
${activity_logs}= RW.CLI.Run Cli
... cmd=START_TIME=$(date -d "${AZ_HISTORY_RANGE} hours ago" '+%Y-%m-%dT%H:%M:%SZ') && END_TIME=$(date '+%Y-%m-%dT%H:%M:%SZ') && az login --service-principal -u $${AZ_USERNAME.key} -p $${AZ_CLIENT_SECRET.key} --tenant $${AZ_TENANT.key} > /dev/null 2>&1 && az monitor activity-log list --start-time $START_TIME --end-time $END_TIME --query "[?resourceType.value=='MICROSOFT.NETWORK/loadbalancers' && resourceId=='${lb_id.stdout}']" | jq -r '.[] | [(.eventTimestamp // "N/A"), (.status.localizedValue // "N/A"), (.subStatus.localizedValue // "N/A"), (.properties.details // "N/A")] | @tsv' | while IFS=$'\t' read -r timestamp status substatus details; do printf "%-30s | %-30s | %-60s | %s\n" "$timestamp" "$status" "$substatus" "$details"; done
... secret__az_username=${AZ_USERNAME}
... secret__az_client_secret=${AZ_CLIENT_SECRET}
... secret__az_tenant=${AZ_TENANT}
${activity_logs_report}= Set Variable "Azure Load Balancer Health Report:"
IF """${activity_logs.stdout}""" == ""
${activity_logs_report}= Set Variable
... "${activity_logs_report}\n\nNo activity log events could be pulled for this resource. If there are events, consider checking the configured time range."
ELSE
${activity_logs_report}= Set Variable
... "${activity_logs_report}\ntimestamp status substatus details\n${activity_logs.stdout}"
END
RW.CLI.Parse Cli Output By Line
... rsp=${activity_logs}
... set_severity_level=2
... set_issue_expected=No activity logs indicating failures for the resource.
... set_issue_actual=Found activity logs indicating the resource has recently experienced an error.
... set_issue_title=Load Balancer Activity Log Indicates Recent Errors
... set_issue_details=Activity Log History\n\n${activity_logs.stdout}
... set_issue_next_steps=Run 'az aks get-credentials' and with the credentials/context provided, use `kubectl describe service -l service.beta.kubernetes.io/azure-load-balancer-internal=true' to get a list of services and inspect their selectors. If the selectors are correct, begin troubleshooting the resource the selectors point to.
... _line__raise_issue_if_contains=Critical
${history}= RW.CLI.Pop Shell History
RW.Core.Add Pre To Report ${activity_logs_report}
RW.Core.Add Pre To Report Commands Used: ${history}


*** Keywords ***
Suite Initialization
${AZ_USERNAME}= RW.Core.Import Secret
... AZ_USERNAME
... type=string
... description=The azure service principal user ID.
... pattern=\w*
${AZ_CLIENT_SECRET}= RW.Core.Import Secret
... AZ_CLIENT_SECRET
... type=string
... description=The service principal client secret used to authenticate with azure.
... pattern=\w*
${AZ_TENANT}= RW.Core.Import Secret
... AZ_TENANT
... type=string
... description=The azure tenant ID used by the service principal to authenticate with azure.
... pattern=\w*
${AZ_HISTORY_RANGE}= RW.Core.Import User Variable
... AZ_HISTORY_RANGE
... type=string
... description=The range of history to check for incidents in the activity log, in hours.
... pattern=\w*
... default=24
... example=24
${AZ_LB_NAME}= RW.Core.Import User Variable
... AZ_LB_NAME
... type=string
... description=The name of the Azure loadbalancer resource, used to map to activity log events.
... pattern=\w*
... example=kubernetes-internal
... example=kubernetes-internal
Set Suite Variable ${AZ_USERNAME} ${AZ_USERNAME}
Set Suite Variable ${AZ_CLIENT_SECRET} ${AZ_CLIENT_SECRET}
Set Suite Variable ${AZ_TENANT} ${AZ_TENANT}
Set Suite Variable ${AZ_HISTORY_RANGE} ${AZ_HISTORY_RANGE}
Set Suite Variable ${AZ_LB_NAME} ${AZ_LB_NAME}
23 changes: 23 additions & 0 deletions codebundles/azure-monitor-event-triage/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Azure Monitor Event Triage

This codebundle queries for general activity log issues and raises them in a tabular report.

## Tasks
`Run Azure Monitor Activity Log Triage`

## Configuration
The TaskSet requires initialization to import necessary secrets, services, and user variables. The following variables should be set:

- `AZ_USERNAME`: Azure service account username secret used to authenticate.
- `AZ_CLIENT_SECRET`: Azure service account client secret used to authenticate.
- `AZ_TENANT`: Azure tenant ID used to authenticate to.
- `AZ_HISTORY_RANGE`: The history range to inspect for incidents in the activity log, in hours. Defaults to 24 hours.

## Requirements
- The azure service principal should have access to the azure monitor API.

## TODO
- [ ] Additional tasks
- [ ] Refine next steps
- [ ] Array support for issues
- [ ] Add additional documentation.
74 changes: 74 additions & 0 deletions codebundles/azure-monitor-event-triage/runbook.robot
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
*** Settings ***
Documentation Triages issues related to a Azure Loadbalancers, Kubernetes ingress objects and services.
Metadata Author jon-funk
Metadata Display Name Azure Monitor Event Triage
Metadata Supports Kubernetes,AKS,Azure

Library BuiltIn
Library RW.Core
Library RW.CLI
Library RW.platform
Library OperatingSystem

Suite Setup Suite Initialization


*** Tasks ***
Run Azure Monitor Activity Log Triage
[Documentation] Queries a Azure Loadbalancer's health probe to determine if it's in a healthy state.
[Tags] load balancer azure
${activity_logs}= RW.CLI.Run Cli
... cmd=START_TIME=$(date -d "${AZ_HISTORY_RANGE} hours ago" '+%Y-%m-%dT%H:%M:%SZ') && END_TIME=$(date '+%Y-%m-%dT%H:%M:%SZ') && az login --service-principal -u $${AZ_USERNAME.key} -p $${AZ_CLIENT_SECRET.key} --tenant $${AZ_TENANT.key} > /dev/null 2>&1 && az monitor activity-log list --start-time $START_TIME --end-time $END_TIME | jq -r '.[] | [(.eventTimestamp // "N/A"), (.status.localizedValue // "N/A"), (.subStatus.localizedValue // "N/A"), (.properties.details // "N/A")] | @tsv' | while IFS=$'\t' read -r timestamp status substatus details; do printf "%-30s | %-30s | %-60s | %s\n" "$timestamp" "$status" "$substatus" "$details"; done
... secret__az_username=${AZ_USERNAME}
... secret__az_client_secret=${AZ_CLIENT_SECRET}
... secret__az_tenant=${AZ_TENANT}
${activity_logs_report}= Set Variable "Azure Monitor Activity Log Report:"
IF """${activity_logs.stdout}""" == ""
${activity_logs_report}= Set Variable
... "${activity_logs_report}\n\nNo activity log events could be pulled in the Azure Tenancy."
ELSE
${activity_logs_report}= Set Variable
... "${activity_logs_report}\ntimestamp status substatus details\n${activity_logs.stdout}"
END
RW.CLI.Parse Cli Output By Line
... rsp=${activity_logs}
... set_severity_level=2
... set_issue_expected=No activity logs indicating failures for the resource.
... set_issue_actual=Found activity logs indicating the resource has recently experienced an error.
... set_issue_title=Azure Monitor Activity Log Indicates Recent Errors
... set_issue_details=Activity Log History\n\n${activity_logs.stdout}
... set_issue_next_steps=Inspect the status, substatus, and details of the activity log report for more details.
... _line__raise_issue_if_contains=Critical
${history}= RW.CLI.Pop Shell History
RW.Core.Add Pre To Report ${activity_logs_report}
RW.Core.Add Pre To Report Commands Used: ${history}


*** Keywords ***
Suite Initialization
${AZ_USERNAME}= RW.Core.Import Secret
... AZ_USERNAME
... type=string
... description=The azure service principal user ID.
... pattern=\w*
${AZ_CLIENT_SECRET}= RW.Core.Import Secret
... AZ_CLIENT_SECRET
... type=string
... description=The service principal client secret used to authenticate with azure.
... pattern=\w*
${AZ_TENANT}= RW.Core.Import Secret
... AZ_TENANT
... type=string
... description=The azure tenant ID used by the service principal to authenticate with azure.
... pattern=\w*
${AZ_HISTORY_RANGE}= RW.Core.Import User Variable
... AZ_HISTORY_RANGE
... type=string
... description=The range of history to check for incidents in the activity log, in hours.
... pattern=\w*
... default=24
... example=24
Set Suite Variable ${AZ_USERNAME} ${AZ_USERNAME}
Set Suite Variable ${AZ_CLIENT_SECRET} ${AZ_CLIENT_SECRET}
Set Suite Variable ${AZ_TENANT} ${AZ_TENANT}
Set Suite Variable ${AZ_HISTORY_RANGE} ${AZ_HISTORY_RANGE}
57 changes: 57 additions & 0 deletions codebundles/azure-monitor-event-triage/sli.robot
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
*** Settings ***
Documentation Measures the count of error activity log entries as a SLI metric for the Azure tenancy.
Metadata Author jon-funk
Metadata Display Name Azure Monitor Activity Log SLI
Metadata Supports Kubernetes,AKS,Azure

Library BuiltIn
Library RW.Core
Library RW.CLI
Library RW.platform
Library OperatingSystem

Suite Setup Suite Initialization


*** Tasks ***
Run Azure Monitor Activity Log Triage
[Documentation] Queries a Azure Loadbalancer's health probe to determine if it's in a healthy state.
[Tags] load balancer azure
${activity_logs_count}= RW.CLI.Run Cli
... cmd=START_TIME=$(date -d "${AZ_HISTORY_RANGE} hours ago" '+%Y-%m-%dT%H:%M:%SZ') && END_TIME=$(date '+%Y-%m-%dT%H:%M:%SZ') && az login --service-principal -u $${AZ_USERNAME.key} -p $${AZ_CLIENT_SECRET.key} --tenant $${AZ_TENANT.key} > /dev/null 2>&1 && az monitor activity-log list --start-time $START_TIME --end-time $END_TIME --status Failed --status Error --status Critical --status "In Progress" | jq -r '. | length'
... secret__az_username=${AZ_USERNAME}
... secret__az_client_secret=${AZ_CLIENT_SECRET}
... secret__az_tenant=${AZ_TENANT}
${history}= RW.CLI.Pop Shell History
Log Running: ${history} resulted in the following count: ${activity_logs_count}
RW.Core.Push Metric ${activity_logs_count}


*** Keywords ***
Suite Initialization
${AZ_USERNAME}= RW.Core.Import Secret
... AZ_USERNAME
... type=string
... description=The azure service principal user ID.
... pattern=\w*
${AZ_CLIENT_SECRET}= RW.Core.Import Secret
... AZ_CLIENT_SECRET
... type=string
... description=The service principal client secret used to authenticate with azure.
... pattern=\w*
${AZ_TENANT}= RW.Core.Import Secret
... AZ_TENANT
... type=string
... description=The azure tenant ID used by the service principal to authenticate with azure.
... pattern=\w*
${AZ_HISTORY_RANGE}= RW.Core.Import User Variable
... AZ_HISTORY_RANGE
... type=string
... description=The range of history to check for incidents in the activity log, in hours.
... pattern=\w*
... default=24
... example=24
Set Suite Variable ${AZ_USERNAME} ${AZ_USERNAME}
Set Suite Variable ${AZ_CLIENT_SECRET} ${AZ_CLIENT_SECRET}
Set Suite Variable ${AZ_TENANT} ${AZ_TENANT}
Set Suite Variable ${AZ_HISTORY_RANGE} ${AZ_HISTORY_RANGE}
28 changes: 28 additions & 0 deletions codebundles/k8s-app-troubleshoot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Kubernetes Application Troubleshoot

This codebundle attempts to identify issues created in application code changes recently. Currently focuses on environment misconfigurations.

## Tasks
`Get Resource Logs`
`Scan For Misconfigured Environment`

## Configuration
The TaskSet requires initialization to import necessary secrets, services, and user variables. The following variables should be set:

- `kubeconfig`: The kubeconfig secret containing access info for the cluster.
- `kubectl`: The location service used to interpret shell commands. Default value is `kubectl-service.shared`.
- `KUBERNETES_DISTRIBUTION_BINARY`: Which binary to use for Kubernetes CLI commands. Default value is `kubectl`.
- `CONTEXT`: The Kubernetes context to operate within.
- `NAMESPACE`: The name of the namespace to search. Leave it blank to search in all namespaces.
- `LABELS`: The labaels used for resource selection, particularly for fetching logs.
- `REPO_URI`: The URI for the git repo used to fetch source code, can be a GitHub URL.
- `NUM_OF_COMMITS`: How many commits to search through into the past to identify potential problems.

## Requirements
- A kubeconfig with appropriate RBAC permissions to perform the desired command.

## TODO
- [ ] New keywords for code inspection
- [ ] SPIKE for potential genAI integration
- [ ] Add additional documentation.

100 changes: 100 additions & 0 deletions codebundles/k8s-app-troubleshoot/env_check.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
#!/bin/bash
# -----------------------------------------------------------------------------
# Script Information and Metadata
# -----------------------------------------------------------------------------
# Author: @jon-funk
# Description: This script checks a resource's logs for errors or potential problems
# related to environment variables and attempts to pinpoint them to recent code changes in the repo.
# -----------------------------------------------------------------------------

# Setup error handling
set -Euo pipefail
# Function to handle errors
function handle_error() {
local line_number=$1
local function_name=$2
local error_code=$3
echo "Error occurred in function '$function_name' at line $line_number with error code $error_code"
}
# Trap error signals to error handler function
trap 'handle_error $LINENO $FUNCNAME $?' ERR

# Check if kubectl is available
if ! command -v kubectl &> /dev/null; then
echo "kubectl command not found!"
exit 1
fi

# Check for namespace argument
if [ -z "$NAMESPACE" ] || [ -z "$CONTEXT" ] || [ -z "$LABELS" ] || [ -z "$REPO_URI" ] || [ -z "$NUM_OF_COMMITS" ]; then
echo "Please set the NAMESPACE, LABELS, REPO_URI, NUM_OF_COMMITS and CONTEXT environment variables"
exit 1
fi

APPLOGS=$(kubectl -n ${NAMESPACE} --context ${CONTEXT} logs deployment,statefulset -l ${LABELS} --all-containers --tail=50 --limit-bytes=256000 | grep -i env || true)
APP_REPO_PATH=/tmp/app_repo
git clone $REPO_URI $APP_REPO_PATH || true
cd $APP_REPO_PATH

changes_to_investigate=""
for word in $APPLOGS; do
checkpath=$(echo "$word" | tr ' ' '\n' | xargs -I{} grep -rin "{}" | grep -E "environment|env" | grep -oE "[A-Z_]{3,}" | sort | uniq || true)
changes_to_investigate+="${checkpath}\n"
done;
changes_to_investigate=$(echo -e $changes_to_investigate | sed 's/ /\n/g' | sort | uniq | sed 's/ /\n/g')
# echo -e $changes_to_investigate

GIT_URL=$(git remote get-url origin | sed -E 's/git@github.com:/https:\/\/github.com\//' | sed 's/.git$//')
BRANCH=$(git rev-parse --abbrev-ref HEAD)

# Create git changes filter for final result
MODIFIED_FILES=$(mktemp)
for word in $changes_to_investigate; do
git diff HEAD~$NUM_OF_COMMITS HEAD --name-only -S "$word" >> "$MODIFIED_FILES"
done
MODIFIED_FILES=$(cat "$MODIFIED_FILES" | sort | uniq)
# echo -e $MODIFIED_FILES | sed 's/ /\n/g'

# Temporary file to store results
TEMPFILE=$(mktemp)
# Search for the words and generate GitHub links with line numbers
for word in $changes_to_investigate; do
grep -rn "$word" . | while IFS=: read -r file line content; do
if echo "$MODIFIED_FILES" | sed 's/ /\n/g' | grep -qF "$(basename $file)"; then
echo "$GIT_URL/blob/$BRANCH/$file#L$line" >> "$TEMPFILE"
fi
done
done

# Sort, make unique and print the results
sort "$TEMPFILE" | uniq

if [[ -n "$changes_to_investigate" ]]; then
echo -e "We found the following Environment variables in the logs, which may indicate a problem with them.\n"
echo -e $(echo -e $changes_to_investigate | sed 's/ /\n/g')
echo -e "\n\n"
else
echo -e "No potential environment variable issues were found in the recent logs."
fi

if [[ -n "$MODIFIED_FILES" ]]; then
echo "They appear in the following files changed within the last $NUM_OF_COMMITS commits."
echo -e $MODIFIED_FILES | sed 's/ /\n/g'
echo -e "\n\n"
else
echo -e "No files were found that contain detected potential environment variable issues.\n"
fi


repo_links=$(cat "$TEMPFILE")
# echo $repo_links
echo "Suggested Next Steps:"
if [[ -n "$repo_links" ]]; then
echo "Investigate the following files for changes related to environment variables"
echo -e "$repo_links"
else
echo "No root-cause files could be found with the available information, check the repo $REPO_URI manaually."
fi

# Clean up the temporary file
rm "$TEMPFILE"
Loading

0 comments on commit 28f452a

Please sign in to comment.