This repo consists of resources that were used while carrying out the Litmus 2.x demonstration in this video tutorial. It is a part of Udemy course Configuring Kubernetes for Reliability with LitmusChaos. They are referenced as part of instructions to replicate the demo environment. Also provided are some basic details about the bank-of-anthos & podtato-kill-with-http-probe chaos workflows described therein.
- Create a multi-node (preferably, 3-node) GKE cluster with compute instance type: Ubuntu with Docker
ubuntu
. Configure cluster access for kubectl.
The following steps provide the testbed configuration instructions.
- Deploy the test applications that will be subjected to chaos
-
Install the LitmusChaos 2.x control plane (chaos center) & local cluster-mode chaos (self) agent
Note: Update your gcp firewall rules to allow traffic to/from the litmusportal server nodeport to ensure successful functioning of the chaos (self) agent.
- Set up the observability infrastructure with kube-prometheus-stack
-
Deploy blackbox exporter to track the podtato-head service's operational characteristics
-
Create servicemonitor custom resources mapped to the chaos exporter and blackbox exporter
-
Add the newly created servicemonitors to the prometheus CR instance to & apply to start scraping the metrics
-
Launch the chaos-instrumented dashboard on Grafana to visualize service metrics
This section describes the intent & functioning behind the two sample chaos workflows used in the demonstration.
-
Period: 0m0s-5m22s
-
Objectives:
- Introduction to the LitmusChaos control plane (chaos center, viz litmus portal)
- Feature Overview
-
Period: 5m23s-11m20s
-
Objectives:
- Creation of a chaos workflow by selecting & tuning an experiment from the integrated chaoshub
- Execution & visualization of workflow progress
- Examination of experiment logs & chaosresults
-
Usecase: The workflow injects 100% network packet loss in the balancereader pod, causing a degraded user experience and a semi-operational/faulty e-banking app.
-
Possible Mitigation/Resilience Fix: Configure services with liveness probes/health-checks that call out accessibility errors(by killing/crashing the containers) with additional replicas of the microservice at hand to serve requests. Further fixes could involve the inclusion of middleware that can re-route request to replicas on other nodes/geo-locations based on (degraded) perf characteristics of the service.
-
Period: 11m21s-16:41s
-
Objectives:
- Creation of chaos workflow from a pre-existing template
- Steady-state hypothesis validation through chaos duration using Litmus HTTP Probe
- Visualization of chaos impact/manual SLO checks via chaos interleaved grafana dashboards
- Examination of experiment logs & chaosresults (with probe success/failures)
-
Usecase: The workflow injects a pod kill/deletion fault on a single-replica podtato-head application causing the availability percentage to drop below the set threshold & also violating access latency limits until the pod is rescheduled and initialized.
-
Possible Mitigation/Resilience Fix: Follow deployment best practices with multi-replica deployments so that kube-proxy can route requests to other live end-points.
To learn more about LitmusChaos 2.x, refer to the documentation. Have a look at the Udemy course Configuring Kubernetes for Reliability with LitmusChaos.