Merge pull request #337 from replicatedhq/danj/troubleshoot-training

Troubleshoot Training with EC
replicatedhq · May 16, 2024 · 4bb6dba · 4bb6dba
2 parents ae42faf + d2fcf44
commit 4bb6dba
Show file tree

Hide file tree

Showing 31 changed files with 817 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,3 +4,4 @@ make.sh
 .envrc
 *.DS_Store
 *.DS_Store?
+.vscode/
diff --git a/instruqt/troubleshoot-training/01-introduction/assignment.md b/instruqt/troubleshoot-training/01-introduction/assignment.md
@@ -0,0 +1,50 @@
+---
+slug: introduction
+id: b5dftki3524w
+type: challenge
+title: Introduction
+teaser: Practical Application of Support Bundles and Analyzers
+notes:
+- type: text
+  contents: In this track, we'll work together to apply some practical methods for
+    troubleshooting some Kubernetes problems using Replicated tooling.
+tabs:
+- title: Workstation
+  type: terminal
+  hostname: cloud-client
+difficulty: intermediate
+timelimit: 600
+---
+
+👋 Introduction
+===============
+
+* **What you will do**:
+  * Learn to troubleshoot application & cluster problems
+* **Who this is for**:
+  * This track is for anyone who will build KOTS applications **plus** anyone user-facing who support these applications:
+    * Full Stack / DevOps / Product Engineers
+    * Support Engineers
+    * Implementation / Field Engineers
+    * Success / Sales Engineers
+* **Prerequisites**:
+  * Basic working knowledge of Linux and the `bash` shell
+* **Outcomes**:
+  * You will be able to determine if the problem is in your application, in Kubernetes, or in the infrastructure environment
+  * You will reduce escalations and expedite time to remediation for such issues
+
+# Configure the VM environment
+
+## Set up the Workstation
+
+The environment is prepped for an *embedded cluster* installation.
+
+### Configure your editor
+
+Before we begin, let's choose an editor.  The default editor is `nano`, but if you'd like to use `vim` instead, you can switch to it by running the following command and selecting option `2`:
+
+```bash
+update-alternatives --config editor
+```
+
+Press **Check** when you're ready to begin.
diff --git a/instruqt/troubleshoot-training/01-introduction/check-cloud-client b/instruqt/troubleshoot-training/01-introduction/check-cloud-client
@@ -0,0 +1,3 @@
+#!/bin/sh
+
+exit 0
diff --git a/instruqt/troubleshoot-training/01-introduction/solve-cloud-client b/instruqt/troubleshoot-training/01-introduction/solve-cloud-client
@@ -0,0 +1,3 @@
+#!/bin/sh
+
+exit 0
diff --git a/instruqt/troubleshoot-training/02-troubleshoot-1/assignment.md b/instruqt/troubleshoot-training/02-troubleshoot-1/assignment.md
@@ -0,0 +1,65 @@
+---
+slug: troubleshoot-1
+id: araxpgiqal1r
+type: challenge
+title: Where are my pods?
+teaser: "\U0001F914"
+notes:
+- type: text
+  contents: The website is down
+tabs:
+- title: Workstation
+  type: terminal
+  hostname: cloud-client
+difficulty: basic
+timelimit: 3600
+---
+Let's imagine that our environment belongs to a customer, who are now experiencing an issue with their install.
+
+They've raised a rather unclear issue to your support team suggesting that the application "doesn't work" after one of their users accientally made a change from the command line.
+
+They've shared a support bundle with you, and you've been asked to help investigate.
+
+Let's use the `sbctl` tool to inspect the support bundle and try to determine what's amiss.  `sbctl` should already be installed and the customer's support bundle should be in your home folder.  `sbctl` simulates having access to the customer's environment, but all of the data is taken from the support bundle.  It lets us use the familiar `kubectl` tool to explore the customer's environment, even without direct access.
+
+When you've identified the problem, write out the commmand you would use to resolve the problem into a file at `/root/solution.txt`
+
+The answer should be one line, on the first line of the file.
+
+(The file does not exist, you will have to create it with your preferred text editor.)
+
+💡 Using `sbctl`
+=================
+
+- Try `sbctl help` to see what commands are available
+
+💡 Hints
+=================
+
+- Try the interactive shell prompt using `sbctl` and make sure to provide the path to the support bundle in your home folder
+
+- How are applications deployed in kubernetes?
+
+- What controls a pod's lifecycle?
+
+💡 More Hints
+=================
+
+- How do I see deployments?
+
+Troubleshooting Procedure
+=================
+
+Identify the problematic deployment from `kubectl get deployments -n <namespace>`.  Notice any pods that have 0 replicas, but should have 1 or more.
+
+✔️  Solution
+==================
+
+A deployment has been scaled to 0
+
+🛠️ Remediation
+=================
+
+```bash
+kubectl scale deployment <deployment-name> --replicas=1
+```
diff --git a/instruqt/troubleshoot-training/02-troubleshoot-1/check-cloud-client b/instruqt/troubleshoot-training/02-troubleshoot-1/check-cloud-client
@@ -0,0 +1,23 @@
+#!/bin/bash
+#
+# This script runs when the platform check the challenge.
+#
+# The platform determines if the script was successful using the exit code of this
+# script. If the exit code is not 0, the script fails.
+
+if [[ ! -f "/root/solution.txt" ]]; then 
+  fail-message "solution.txt not found, please create it and write your answer within"
+  exit 1
+fi
+
+solution=$(head -n1 "/root/solution.txt" | sed 's/=/ /g' | sed -e 's/--namespace\ default//g' -e 's/-n\ default//g' | sed -re 's/^[[:blank:]]+|[[:blank:]]+$//g' -e 's/[[:blank:]]+/ /g' )
+
+echo "solution: $solution"
+echo "wanted  : kubectl scale deployment frontend --replicas 1"
+
+if [[ "$solution" = "kubectl scale deployment frontend --replicas 1" ]]; then
+  exit 0
+fi
+
+fail-message "oops, your solution doesn't quite look correct, try again!"
+exit 1
diff --git a/instruqt/troubleshoot-training/02-troubleshoot-1/setup-cloud-client b/instruqt/troubleshoot-training/02-troubleshoot-1/setup-cloud-client
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+curl https://spooky.academy/support_bundles/troubleshoot_1_support_bundle.tar.gz -o support-bundle.tar.gz
diff --git a/instruqt/troubleshoot-training/02-troubleshoot-1/solve-cloud-client b/instruqt/troubleshoot-training/02-troubleshoot-1/solve-cloud-client
@@ -0,0 +1,4 @@
+#!/bin/bash
+
+rm -rf /root/support-bundle*
+rm -rf /root/solution*
diff --git a/instruqt/troubleshoot-training/03-troubleshoot-2/assignment.md b/instruqt/troubleshoot-training/03-troubleshoot-2/assignment.md
@@ -0,0 +1,100 @@
+---
+slug: troubleshoot-2
+id: gzv8orjeqdcg
+type: challenge
+title: CrashLoopBackOff
+teaser: "\U0001F648"
+notes:
+- type: text
+  contents: Time to fix another problem...
+tabs:
+- title: Workstation
+  type: terminal
+  hostname: cloud-client
+difficulty: intermediate
+timelimit: 3600
+---
+The customer opens another issue, but this time pods seem to be crashing.
+
+Let's investigate our app and see if we can identify the issue. Again, we'll use `sbctl` to explore the support bundle.
+
+To pass this challenge, find the faulty resource, save the YAML spec for that resource to `~/solution.yaml`, correct the problem in the resource, then click "Next" to check your work.
+
+💡 Using `sbctl`
+=================
+
+- Remember that you can use the interactive shell prompt with `sbctl shell -s <path-to-support-bundle>`
+
+💡 Using `kubectl`
+=================
+
+- How do you make `kubectl` print output in YAML format?
+  -- What if you wanted to save that output to a file?
+
+💡 Hints
+=================
+
+- How do you list pods?
+
+- How do you describe pods?
+  - What if you wanted to see data from multiple pods at once?
+
+- How do you get logs from a pod?
+  - What if you wanted to see a previous version of the pod's logs?
+
+- When would you look at `describe` output vs. gathering pod logs?
+
+- Review the [Kubernetes documentation on debugging Pods](https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/)
+
+💡 More Hints
+=================
+
+- How do you find the exit code of a Pod?
+
+- What could it mean if a Pod is exiting before it has a chance to emit any logs?
+
+- Review the Linux exit code conventions: `0` means the process exited normally, `1`-`127` generally mean that the process exited because of a crash or error, and >`128` generally means that the process was killed by a signal (think Ctrl-C or the `kill` command).
+
+Troubleshooting Procedure
+=================
+
+Identify the problematic Pod from `kubectl get pods -n <namespace>`.  Notice any pods that are not in the Running state.
+
+Describe the current state of the Pod with `kubectl describe pod -n <namespace> <pod-name>`.  Here are some things to look out for:
+
+- each Container's current **State** and **Reason**
+- each Container's **Last State** and **Reason**
+- the Last State's **Exit Code**
+- each Container's **Ready** status
+- the **Events** table
+
+For a Pod that is crashing, expect that the current state will be `Waiting`, `Terminated` or `Error`, and the last state will probably also be `Terminated`.  Notice the reason for the termination, and especially notice the exit code.  There are standards for the exit code originally set by the `chroot` standards, but they are not strictly enforced since applications can always set their own exit codes.
+
+In short, if the exit code is >128, then the application exited as a result of Kubernetes killing the Pod.  If that's the case, you'll commonly see code 137 or 143, which is 128 + the value of the `kill` signal sent to the container.
+
+If the exit code is <128, then the application crashed or exited abnormally.  If the exit code is 0, then the application exited normally (most commonly seen in init containers or Jobs/CronJobs)
+
+Look for any Events that may indicate a problem.  Events by default last 1 hour, unless they occur repeatedly.  Events in a repetition loop are especially noteworthy:
+
+```plaintext
+Events:
+  Type     Reason                  Age                      From     Message
+  ----     ------                  ----                     ----     -------
+  Warning  BackOff                 2d19h (x9075 over 4d4h)  kubelet  Back-off restarting failed container sentry-workers in pod sentry-worker-696456b57c-twpj7_default(82eb1dde-2987-4f58-af64-883470ffcb58)
+```
+
+Another way to get even more information about a pod is to use the `-o yaml` option with `kubectl get pods`.  This will output the entire pod definition in YAML format.  This is useful for debugging issues with the pod definition itself.  Here you will see some info that isn't present in `describe pods`, such as Annotations, Tolerations, restart policy, ports, and volumes.
+
+✔️  Solution
+=================
+
+One of the deployments has a memory limit that is too low for the Pod to run successfully.
+
+🛠️ Remediation
+=================
+
+Write the YAML spec for the affected deployment into a file at `~/solution.yaml`, then increase the memory limit for the Pod to a reasonable amount.  You may have to make an educated guess about what the correct memory limit should be.
+
+To think about:
+
+- How can we make sure that this doesn't happen again?
diff --git a/instruqt/troubleshoot-training/03-troubleshoot-2/check-cloud-client b/instruqt/troubleshoot-training/03-troubleshoot-2/check-cloud-client
@@ -0,0 +1,29 @@
+#!/bin/bash
+
+if [[ ! -f /root/solution.yaml ]]; then
+  fail-message "solution.yaml not found in /root/, please create the file and try again"
+  exit 1
+fi
+
+kind=$(yq -r '.kind' /root/solution.yaml)
+
+if [[ ! "$kind" = "Deployment" ]]; then
+  fail-message "your solution doesn't look correct, you appear to have saved a resource that we weren't expecting"
+  exit 1
+fi
+
+limits=$(yq '.spec.template.spec.containers[0].resources.limits.memory' solution.yaml -r)
+
+if [[ "$limits" = "5M" ]]; then
+  fail-message "it looks like your solution is incorrect"
+  echo "limits = 5M"
+  exit 1
+fi
+
+rawSize=$(humanfriendly --parse-size "$limits")
+
+if [[ ! "$rawSize" -gt "5000000" ]];then 
+  fail-message "it looks like your solution is incorrect"
+  echo "limits < 5M"
+  exit 1
+fi
diff --git a/instruqt/troubleshoot-training/03-troubleshoot-2/setup-cloud-client b/instruqt/troubleshoot-training/03-troubleshoot-2/setup-cloud-client
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+set -euxo
+rm -rf /root/support-bundle* || true
+rm /root/solution.txt || true
+
+curl https://spooky.academy/support_bundles/troubleshoot_2_support_bundle.tar.gz -o support-bundle.tar.gz
diff --git a/instruqt/troubleshoot-training/03-troubleshoot-2/solve-cloud-client b/instruqt/troubleshoot-training/03-troubleshoot-2/solve-cloud-client
@@ -0,0 +1,4 @@
+#!/bin/bash
+
+rm -rf /root/support-bundle*
+rm -rf /root/solution*