Skip to content

Commit

Permalink
Add Application Troubleshooting V2 (#242)
Browse files Browse the repository at this point in the history
* Add stubs for app troubleshooting

* Add cached diffs for commits

* Add parse log exceptions functionality

* Add parser updates for sli exception monitoring

* Finish report formatter

* Add github issue functionality

* Clone to PWD for runner
  • Loading branch information
jon-funk authored Nov 17, 2023
1 parent 03e8809 commit 5100487
Show file tree
Hide file tree
Showing 9 changed files with 1,018 additions and 14 deletions.
11 changes: 9 additions & 2 deletions codebundles/k8s-app-troubleshoot/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Kubernetes Application Troubleshoot

This codebundle attempts to identify issues created in application code changes recently. Currently focuses on environment misconfigurations.
This codebundle attempts to identify issues created in application code changes recently.

## Tasks
`Get Resource Logs`
`Scan For Misconfigured Environment`
`Troubleshoot Application Logs`

## Configuration
The TaskSet requires initialization to import necessary secrets, services, and user variables. The following variables should be set:
Expand All @@ -17,9 +18,15 @@ The TaskSet requires initialization to import necessary secrets, services, and u
- `LABELS`: The labaels used for resource selection, particularly for fetching logs.
- `REPO_URI`: The URI for the git repo used to fetch source code, can be a GitHub URL.
- `NUM_OF_COMMITS`: How many commits to search through into the past to identify potential problems.
- `CREATE_ISSUES`: A boolean flag whether or not to create github issues for the related parsed exceptions.
- `LOGS_SINCE`: How far back to scan for logs, eg: 20m, 3h
- `EXCLUDE_PATTERN`: a extended grep pattern used to filter out log results, such as exceptions/errors that you don't care about.
- `CONTAINER_NAME`: the name of the container within the labeled workload to fetch logs from.
- `MAX_LOG_LINES`: The maximum number of logs to fetch. Setting this too high can effect performance.

## Requirements
- A kubeconfig with appropriate RBAC permissions to perform the desired command.
- A kubeconfig with appropriate RBAC permissions to perform the desired command, particularly exec
- A oauth token for github authentication, with read permissions on repositories(s) and write permissions on issues.

## TODO
- [ ] New keywords for code inspection
Expand Down
2 changes: 1 addition & 1 deletion codebundles/k8s-app-troubleshoot/env_check.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ if [ -z "$NAMESPACE" ] || [ -z "$CONTEXT" ] || [ -z "$LABELS" ] || [ -z "$REPO_U
exit 1
fi

APPLOGS=$(kubectl -n ${NAMESPACE} --context ${CONTEXT} logs deployment,statefulset -l ${LABELS} --all-containers --tail=50 --limit-bytes=256000 | grep -i env || true)
APPLOGS=$(kubectl -n ${NAMESPACE} --context ${CONTEXT} logs $(kubectl --context=${CONTEXT} -n ${NAMESPACE} get deployment,statefulset -l ${LABELS} -oname | head -n 1) --all-containers --tail=50 --limit-bytes=256000 | grep -i env || true)
APP_REPO_PATH=/tmp/app_repo
git clone $REPO_URI $APP_REPO_PATH || true
cd $APP_REPO_PATH
Expand Down
131 changes: 120 additions & 11 deletions codebundles/k8s-app-troubleshoot/runbook.robot
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Metadata Supports Kubernetes,AKS,EKS,GKE,OpenShift
Library BuiltIn
Library RW.Core
Library RW.CLI
Library RW.K8sApplications
Library RW.platform
Library RW.NextSteps
Library OperatingSystem
Expand All @@ -15,15 +16,15 @@ Suite Setup Suite Initialization


*** Tasks ***
Get Resource Logs
[Documentation] Collects the last approximately 200 lines of logs from the resource before restarting it.
Get Workload Logs
[Documentation] Collects the last approximately 300 lines of logs from the workload before restarting it.
[Tags] resource application workload logs state
${logs}= RW.CLI.Run Cli
... cmd=${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} logs deployment,statefulset -l ${LABELS} --tail=200 --limit-bytes=256000
... cmd=${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} logs $(${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} get deployment,statefulset -l ${LABELS} -oname | head -n 1) --tail=${MAX_LOG_LINES} --limit-bytes=256000 --since=${LOGS_SINCE} --container=${CONTAINER_NAME}
... render_in_commandlist=true
... env=${env}
... secret_file__kubeconfig=${kubeconfig}
RW.Core.Add Pre To Report Resource Logs:\n\n${logs.stdout}
RW.Core.Add Pre To Report Workload Logs:\n\n${logs.stdout}
${history}= RW.CLI.Pop Shell History
RW.Core.Add Pre To Report Commands Used: ${history}

Expand All @@ -39,6 +40,74 @@ Scan For Misconfigured Environment
RW.Core.Add Pre To Report Stdout:\n\n${script_run.stdout}
RW.Core.Add Pre To Report Commands Used: ${history}

Troubleshoot Application Logs
[Documentation] Performs an inspection on container logs for exceptions, parsing those exceptions and attempts to find relevant source code information
[Tags] application debug errors troubleshoot workload
${cmd}= Set Variable
... ${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} logs $(${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} get deployment,statefulset -l ${LABELS} -oname | head -n 1) --tail=${MAX_LOG_LINES} --limit-bytes=256000 --since=${LOGS_SINCE} --container=${CONTAINER_NAME}
IF $EXCLUDE_PATTERN != ""
${cmd}= Set Variable
... ${cmd} | grep -Eiv "${EXCLUDE_PATTERN}" || true
END
${logs}= RW.CLI.Run Cli
... cmd=${cmd}
... render_in_commandlist=true
... env=${env}
... secret_file__kubeconfig=${kubeconfig}
${printenv}= RW.CLI.Run Cli
... cmd=${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} exec $(${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} get all -l ${LABELS} -oname | grep -iE "deploy|stateful" | head -n 1) --container=${CONTAINER_NAME} -- printenv
... render_in_commandlist=true
... include_in_history=False
... env=${env}
... secret_file__kubeconfig=${kubeconfig}
${proc_list}= RW.CLI.Run Cli
... cmd=${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} exec $(${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} get all -l ${LABELS} -oname | grep -iE "deploy|stateful" | head -n 1) --container=${CONTAINER_NAME} -- ps -eo command --no-header | grep -v "ps -eo"
... render_in_commandlist=true
... include_in_history=False
... env=${env}
... secret_file__kubeconfig=${kubeconfig}

# ${test_data}= RW.K8sApplications.Get Test Data
${proc_list}= RW.K8sApplications.Format Process List ${proc_list.stdout}
${serialized_env}= RW.K8sApplications.Serialize env ${printenv.stdout}
${parsed_exceptions}= RW.K8sApplications.Parse Exceptions ${logs.stdout}
# ${parsed_exceptions}= RW.K8sApplications.Parse Exceptions ${test_data}
${app_repo}= RW.K8sApplications.Clone Repo ${REPO_URI} ${REPO_AUTH_TOKEN} ${NUM_OF_COMMITS}
${repos}= Create List ${app_repo}
${ts_results}= RW.K8sApplications.Troubleshoot Application
... repos=${repos}
... exceptions=${parsed_exceptions}
... env=${serialized_env}
... process_list=${proc_list}
${history}= RW.CLI.Pop Shell History
${full_report}= Evaluate $ts_results.get("report")
${found_exceptions}= Evaluate $ts_results.get("found_exceptions")
${full_report}= Set Variable
... ${full_report}\nHere's the command used to collect the exception data:\n${history}
RW.Core.Add Pre To Report ${full_report}

${issue_link}= Set Variable \n
IF "${CREATE_ISSUES}" == "YES"
${issue_link}= RW.K8sApplications.Create Github Issue ${repos[0]} ${full_report}
RW.Core.Add Pre To Report \n${issue_link}
END
${nextsteps}= Evaluate
... "${issue_link}" if len($issue_link) > 5 else "View the summary in details for possible links to the source code related to the exceptions found in the ${CONTAINER_NAME} application."
IF (len($parsed_exceptions)) > 0
RW.Core.Add Issue
... severity=3
... expected=No exceptions were found in the parsed logs of workload ${CONTAINER_NAME}
... actual=Found exceptions in the workload logs of ${CONTAINER_NAME}
... reproduce_hint=Run:\n${cmd}\n view logs results for exceptions.
... title=Found exception in ${CONTAINER_NAME} logs
... details=${full_report}
... next_steps=${nextsteps}
END

# TODO: implement tasks:
# Troubleshoot Application Endpoints
# Check Database Migrations


*** Keywords ***
Suite Initialization
Expand All @@ -48,10 +117,6 @@ Suite Initialization
... description=The kubernetes kubeconfig yaml containing connection configuration used to connect to cluster(s).
... pattern=\w*
... example=For examples, start here https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/
${kubectl}= RW.Core.Import Service kubectl
... description=The location service used to interpret shell commands.
... default=kubectl-service.shared
... example=kubectl-service.shared
${NAMESPACE}= RW.Core.Import User Variable NAMESPACE
... type=string
... description=The name of the Kubernetes namespace to scope actions and searching to.
Expand All @@ -76,6 +141,11 @@ Suite Initialization
... pattern=\w*
... example=https://github.com/runwhen-contrib/runwhen-local
... default=https://github.com/runwhen-contrib/runwhen-local
${REPO_AUTH_TOKEN}= RW.Core.Import Secret
... REPO_AUTH_TOKEN
... type=string
... description=The oauth token to be used for authenticating to the repo during cloning.
... pattern=\w*
${LABELS}= RW.Core.Import User Variable LABELS
... type=string
... description=The Kubernetes labels used to select the resource for logs.
Expand All @@ -85,15 +155,54 @@ Suite Initialization
... type=string
... description=The number of commits to look through when troubleshooting. Adjust this based on your team's git usage and commit frequency.
... pattern=\w*
... example=3
... default=3
... example=50
... default=50
${CREATE_ISSUES}= RW.Core.Import User Variable CREATE_ISSUES
... type=string
... description=Whether or not the taskset should create github issues when it finds problems.
... enum=[YES,NO]
... example=YES
... default=YES
${LOGS_SINCE}= RW.Core.Import User Variable
... LOGS_SINCE
... type=string
... description=How far back to fetch logs from containers in Kubernetes. Making this too recent and running the codebundle often could cause adverse performance.
... pattern=\w*
... example=15m
... default=15m
${EXCLUDE_PATTERN}= RW.Core.Import User Variable
... EXCLUDE_PATTERN
... type=string
... description=Grep pattern to use to exclude exceptions that don't indicate a critical issue.
... pattern=\w*
... example=FalseError|SecondErrorToSkip
... default=FalseError|SecondErrorToSkip
${CONTAINER_NAME}= RW.Core.Import User Variable
... CONTAINER_NAME
... type=string
... description=The name of the container within the selected pod that represents the application to troubleshoot.
... pattern=\w*
... example=myapp
${MAX_LOG_LINES}= RW.Core.Import User Variable
... MAX_LOG_LINES
... type=string
... description=The max number of log lines to request from Kubernetes workloads to be parsed. Setting this too high can adversely effect performance.
... pattern=\w*
... example=300
... default=300
Set Suite Variable ${kubeconfig} ${kubeconfig}
Set Suite Variable ${kubectl} ${kubectl}
Set Suite Variable ${KUBERNETES_DISTRIBUTION_BINARY} ${KUBERNETES_DISTRIBUTION_BINARY}
Set Suite Variable ${CONTEXT} ${CONTEXT}
Set Suite Variable ${REPO_URI} ${REPO_URI}
Set Suite Variable ${LABELS} ${LABELS}
Set Suite Variable ${NAMESPACE} ${NAMESPACE}
Set Suite Variable ${REPO_AUTH_TOKEN} ${REPO_AUTH_TOKEN}
Set Suite Variable ${CREATE_ISSUES} ${CREATE_ISSUES}
Set Suite Variable ${LOGS_SINCE} ${LOGS_SINCE}
Set Suite Variable ${EXCLUDE_PATTERN} ${EXCLUDE_PATTERN}
Set Suite Variable ${CONTAINER_NAME} ${CONTAINER_NAME}
Set Suite Variable ${NUM_OF_COMMITS} ${NUM_OF_COMMITS}
Set Suite Variable ${MAX_LOG_LINES} ${MAX_LOG_LINES}
Set Suite Variable
... ${env}
... {"NUM_OF_COMMITS":"${NUM_OF_COMMITS}", "REPO_URI":"${REPO_URI}", "LABELS":"${LABELS}", "KUBECONFIG":"./${kubeconfig.key}", "KUBERNETES_DISTRIBUTION_BINARY":"${KUBERNETES_DISTRIBUTION_BINARY}", "CONTEXT":"${CONTEXT}", "NAMESPACE":"${NAMESPACE}"}
107 changes: 107 additions & 0 deletions codebundles/k8s-app-troubleshoot/sli.robot
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
*** Settings ***
Documentation Measures the number of exception stacktraces present in an application's logs over a time period.
Metadata Author jon-funk
Metadata Display Name Kubernetes Application Monitor
Metadata Supports Kubernetes,AKS,EKS,GKE,OpenShift

Library BuiltIn
Library RW.Core
Library RW.CLI
Library RW.K8sApplications
Library RW.platform
Library RW.NextSteps
Library OperatingSystem

Suite Setup Suite Initialization


*** Tasks ***
Measure Application Exceptions
[Documentation] Examines recent logs for exceptions, providing a count of them.
[Tags] resource application workload logs state exceptions errors
${cmd}= Set Variable
... ${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} logs $(${KUBERNETES_DISTRIBUTION_BINARY} --context=${CONTEXT} -n ${NAMESPACE} get deployment,statefulset -l ${LABELS} -oname | head -n 1) --tail=${MAX_LOG_LINES} --limit-bytes=256000 --since=${LOGS_SINCE} --container=${CONTAINER_NAME}
IF $EXCLUDE_PATTERN != ""
${cmd}= Set Variable
... ${cmd} | grep -Eiv "${EXCLUDE_PATTERN}" || true
END

${logs}= RW.CLI.Run Cli
... cmd=${cmd}
... render_in_commandlist=true
... env=${env}
... secret_file__kubeconfig=${kubeconfig}
${parsed_exceptions}= RW.K8sApplications.Parse Exceptions ${logs.stdout}
${count}= Evaluate len($parsed_exceptions)
RW.Core.Push Metric ${count}


*** Keywords ***
Suite Initialization
${kubeconfig}= RW.Core.Import Secret
... kubeconfig
... type=string
... description=The kubernetes kubeconfig yaml containing connection configuration used to connect to cluster(s).
... pattern=\w*
... example=For examples, start here https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/
${NAMESPACE}= RW.Core.Import User Variable NAMESPACE
... type=string
... description=The name of the Kubernetes namespace to scope actions and searching to.
... pattern=\w*
... example=my-namespace
... default=sock-shop
${CONTEXT}= RW.Core.Import User Variable CONTEXT
... type=string
... description=Which Kubernetes context to operate within.
... pattern=\w*
... example=sandbox-cluster-1
... default=sandbox-cluster-1
${KUBERNETES_DISTRIBUTION_BINARY}= RW.Core.Import User Variable KUBERNETES_DISTRIBUTION_BINARY
... type=string
... description=Which binary to use for Kubernetes CLI commands.
... enum=[kubectl,oc]
... example=kubectl
... default=kubectl
${LABELS}= RW.Core.Import User Variable LABELS
... type=string
... description=The Kubernetes labels used to select the resource for logs.
... pattern=\w*
${LOGS_SINCE}= RW.Core.Import User Variable
... LOGS_SINCE
... type=string
... description=How far back to fetch logs from containers in Kubernetes. Making this too recent and running the codebundle often could cause adverse performance.
... pattern=\w*
... example=15m
... default=15m
${EXCLUDE_PATTERN}= RW.Core.Import User Variable
... EXCLUDE_PATTERN
... type=string
... description=Grep pattern to use to exclude exceptions that don't indicate a critical issue.
... pattern=\w*
... example=FalseError|SecondErrorToSkip
... default=FalseError|SecondErrorToSkip
${CONTAINER_NAME}= RW.Core.Import User Variable
... CONTAINER_NAME
... type=string
... description=The name of the container within the selected pod that represents the application to troubleshoot.
... pattern=\w*
... example=myapp
${MAX_LOG_LINES}= RW.Core.Import User Variable
... MAX_LOG_LINES
... type=string
... description=The max number of log lines to request from Kubernetes workloads to be parsed. Setting this too high can adversely effect performance.
... pattern=\w*
... example=300
... default=300
Set Suite Variable ${kubeconfig} ${kubeconfig}
Set Suite Variable ${KUBERNETES_DISTRIBUTION_BINARY} ${KUBERNETES_DISTRIBUTION_BINARY}
Set Suite Variable ${CONTEXT} ${CONTEXT}
Set Suite Variable ${LABELS} ${LABELS}
Set Suite Variable ${NAMESPACE} ${NAMESPACE}
Set Suite Variable ${LOGS_SINCE} ${LOGS_SINCE}
Set Suite Variable ${EXCLUDE_PATTERN} ${EXCLUDE_PATTERN}
Set Suite Variable ${CONTAINER_NAME} ${CONTAINER_NAME}
Set Suite Variable ${MAX_LOG_LINES} ${MAX_LOG_LINES}
Set Suite Variable
... ${env}
... {"LABELS":"${LABELS}", "KUBECONFIG":"./${kubeconfig.key}", "KUBERNETES_DISTRIBUTION_BINARY":"${KUBERNETES_DISTRIBUTION_BINARY}", "CONTEXT":"${CONTEXT}", "NAMESPACE":"${NAMESPACE}"}
1 change: 1 addition & 0 deletions libraries/RW/K8sApplications/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .k8s_applications import *
Loading

0 comments on commit 5100487

Please sign in to comment.