Skip to content

Commit

Permalink
Merge pull request #337 from replicatedhq/danj/troubleshoot-training
Browse files Browse the repository at this point in the history
Troubleshoot Training with EC
  • Loading branch information
adamancini authored May 16, 2024
2 parents ae42faf + d2fcf44 commit 4bb6dba
Show file tree
Hide file tree
Showing 31 changed files with 817 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ make.sh
.envrc
*.DS_Store
*.DS_Store?
.vscode/
50 changes: 50 additions & 0 deletions instruqt/troubleshoot-training/01-introduction/assignment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
slug: introduction
id: b5dftki3524w
type: challenge
title: Introduction
teaser: Practical Application of Support Bundles and Analyzers
notes:
- type: text
contents: In this track, we'll work together to apply some practical methods for
troubleshooting some Kubernetes problems using Replicated tooling.
tabs:
- title: Workstation
type: terminal
hostname: cloud-client
difficulty: intermediate
timelimit: 600
---

👋 Introduction
===============

* **What you will do**:
* Learn to troubleshoot application & cluster problems
* **Who this is for**:
* This track is for anyone who will build KOTS applications **plus** anyone user-facing who support these applications:
* Full Stack / DevOps / Product Engineers
* Support Engineers
* Implementation / Field Engineers
* Success / Sales Engineers
* **Prerequisites**:
* Basic working knowledge of Linux and the `bash` shell
* **Outcomes**:
* You will be able to determine if the problem is in your application, in Kubernetes, or in the infrastructure environment
* You will reduce escalations and expedite time to remediation for such issues

# Configure the VM environment

## Set up the Workstation

The environment is prepped for an *embedded cluster* installation.

### Configure your editor

Before we begin, let's choose an editor. The default editor is `nano`, but if you'd like to use `vim` instead, you can switch to it by running the following command and selecting option `2`:

```bash
update-alternatives --config editor
```

Press **Check** when you're ready to begin.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh

exit 0
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh

exit 0
65 changes: 65 additions & 0 deletions instruqt/troubleshoot-training/02-troubleshoot-1/assignment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
slug: troubleshoot-1
id: araxpgiqal1r
type: challenge
title: Where are my pods?
teaser: "\U0001F914"
notes:
- type: text
contents: The website is down
tabs:
- title: Workstation
type: terminal
hostname: cloud-client
difficulty: basic
timelimit: 3600
---
Let's imagine that our environment belongs to a customer, who are now experiencing an issue with their install.

They've raised a rather unclear issue to your support team suggesting that the application "doesn't work" after one of their users accientally made a change from the command line.

They've shared a support bundle with you, and you've been asked to help investigate.

Let's use the `sbctl` tool to inspect the support bundle and try to determine what's amiss. `sbctl` should already be installed and the customer's support bundle should be in your home folder. `sbctl` simulates having access to the customer's environment, but all of the data is taken from the support bundle. It lets us use the familiar `kubectl` tool to explore the customer's environment, even without direct access.

When you've identified the problem, write out the commmand you would use to resolve the problem into a file at `/root/solution.txt`

The answer should be one line, on the first line of the file.

(The file does not exist, you will have to create it with your preferred text editor.)

💡 Using `sbctl`
=================

- Try `sbctl help` to see what commands are available

💡 Hints
=================

- Try the interactive shell prompt using `sbctl` and make sure to provide the path to the support bundle in your home folder

- How are applications deployed in kubernetes?

- What controls a pod's lifecycle?

💡 More Hints
=================

- How do I see deployments?

Troubleshooting Procedure
=================

Identify the problematic deployment from `kubectl get deployments -n <namespace>`. Notice any pods that have 0 replicas, but should have 1 or more.

✔️ Solution
==================

A deployment has been scaled to 0

🛠️ Remediation
=================

```bash
kubectl scale deployment <deployment-name> --replicas=1
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
#
# This script runs when the platform check the challenge.
#
# The platform determines if the script was successful using the exit code of this
# script. If the exit code is not 0, the script fails.

if [[ ! -f "/root/solution.txt" ]]; then
fail-message "solution.txt not found, please create it and write your answer within"
exit 1
fi

solution=$(head -n1 "/root/solution.txt" | sed 's/=/ /g' | sed -e 's/--namespace\ default//g' -e 's/-n\ default//g' | sed -re 's/^[[:blank:]]+|[[:blank:]]+$//g' -e 's/[[:blank:]]+/ /g' )

echo "solution: $solution"
echo "wanted : kubectl scale deployment frontend --replicas 1"

if [[ "$solution" = "kubectl scale deployment frontend --replicas 1" ]]; then
exit 0
fi

fail-message "oops, your solution doesn't quite look correct, try again!"
exit 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash

curl https://spooky.academy/support_bundles/troubleshoot_1_support_bundle.tar.gz -o support-bundle.tar.gz
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash

rm -rf /root/support-bundle*
rm -rf /root/solution*
100 changes: 100 additions & 0 deletions instruqt/troubleshoot-training/03-troubleshoot-2/assignment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
slug: troubleshoot-2
id: gzv8orjeqdcg
type: challenge
title: CrashLoopBackOff
teaser: "\U0001F648"
notes:
- type: text
contents: Time to fix another problem...
tabs:
- title: Workstation
type: terminal
hostname: cloud-client
difficulty: intermediate
timelimit: 3600
---
The customer opens another issue, but this time pods seem to be crashing.

Let's investigate our app and see if we can identify the issue. Again, we'll use `sbctl` to explore the support bundle.

To pass this challenge, find the faulty resource, save the YAML spec for that resource to `~/solution.yaml`, correct the problem in the resource, then click "Next" to check your work.

💡 Using `sbctl`
=================

- Remember that you can use the interactive shell prompt with `sbctl shell -s <path-to-support-bundle>`

💡 Using `kubectl`
=================

- How do you make `kubectl` print output in YAML format?
-- What if you wanted to save that output to a file?

💡 Hints
=================

- How do you list pods?

- How do you describe pods?
- What if you wanted to see data from multiple pods at once?

- How do you get logs from a pod?
- What if you wanted to see a previous version of the pod's logs?

- When would you look at `describe` output vs. gathering pod logs?

- Review the [Kubernetes documentation on debugging Pods](https://kubernetes.io/docs/tasks/debug/debug-application/debug-running-pod/)

💡 More Hints
=================

- How do you find the exit code of a Pod?

- What could it mean if a Pod is exiting before it has a chance to emit any logs?

- Review the Linux exit code conventions: `0` means the process exited normally, `1`-`127` generally mean that the process exited because of a crash or error, and >`128` generally means that the process was killed by a signal (think Ctrl-C or the `kill` command).

Troubleshooting Procedure
=================

Identify the problematic Pod from `kubectl get pods -n <namespace>`. Notice any pods that are not in the Running state.

Describe the current state of the Pod with `kubectl describe pod -n <namespace> <pod-name>`. Here are some things to look out for:

- each Container's current **State** and **Reason**
- each Container's **Last State** and **Reason**
- the Last State's **Exit Code**
- each Container's **Ready** status
- the **Events** table

For a Pod that is crashing, expect that the current state will be `Waiting`, `Terminated` or `Error`, and the last state will probably also be `Terminated`. Notice the reason for the termination, and especially notice the exit code. There are standards for the exit code originally set by the `chroot` standards, but they are not strictly enforced since applications can always set their own exit codes.

In short, if the exit code is >128, then the application exited as a result of Kubernetes killing the Pod. If that's the case, you'll commonly see code 137 or 143, which is 128 + the value of the `kill` signal sent to the container.

If the exit code is <128, then the application crashed or exited abnormally. If the exit code is 0, then the application exited normally (most commonly seen in init containers or Jobs/CronJobs)

Look for any Events that may indicate a problem. Events by default last 1 hour, unless they occur repeatedly. Events in a repetition loop are especially noteworthy:

```plaintext
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 2d19h (x9075 over 4d4h) kubelet Back-off restarting failed container sentry-workers in pod sentry-worker-696456b57c-twpj7_default(82eb1dde-2987-4f58-af64-883470ffcb58)
```

Another way to get even more information about a pod is to use the `-o yaml` option with `kubectl get pods`. This will output the entire pod definition in YAML format. This is useful for debugging issues with the pod definition itself. Here you will see some info that isn't present in `describe pods`, such as Annotations, Tolerations, restart policy, ports, and volumes.

✔️ Solution
=================

One of the deployments has a memory limit that is too low for the Pod to run successfully.

🛠️ Remediation
=================

Write the YAML spec for the affected deployment into a file at `~/solution.yaml`, then increase the memory limit for the Pod to a reasonable amount. You may have to make an educated guess about what the correct memory limit should be.

To think about:

- How can we make sure that this doesn't happen again?
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash

if [[ ! -f /root/solution.yaml ]]; then
fail-message "solution.yaml not found in /root/, please create the file and try again"
exit 1
fi

kind=$(yq -r '.kind' /root/solution.yaml)

if [[ ! "$kind" = "Deployment" ]]; then
fail-message "your solution doesn't look correct, you appear to have saved a resource that we weren't expecting"
exit 1
fi

limits=$(yq '.spec.template.spec.containers[0].resources.limits.memory' solution.yaml -r)

if [[ "$limits" = "5M" ]]; then
fail-message "it looks like your solution is incorrect"
echo "limits = 5M"
exit 1
fi

rawSize=$(humanfriendly --parse-size "$limits")

if [[ ! "$rawSize" -gt "5000000" ]];then
fail-message "it looks like your solution is incorrect"
echo "limits < 5M"
exit 1
fi
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

set -euxo
rm -rf /root/support-bundle* || true
rm /root/solution.txt || true

curl https://spooky.academy/support_bundles/troubleshoot_2_support_bundle.tar.gz -o support-bundle.tar.gz
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash

rm -rf /root/support-bundle*
rm -rf /root/solution*
Loading

0 comments on commit 4bb6dba

Please sign in to comment.