AZ-400: Develop a Site Reliability Engineering (SRE) strategy (5-10%)

Develop an actionable alerting strategy

Analyze behavior of system with regards to load and failure conditions
Calculate when a system will fail under various conditions
Measure baseline metrics for system
Leverage Application Insights Smart Detection and Dynamic thresholds in Azure Monitor

Analyze system dependencies to determine which dependency should be included in health check
Calculate healthy response timeouts based on SLO for the service
Design approach for partial health situations
Design approach for piecemeal recovery
- to improve recovery time objective strategies
Integrate health check with compute environment
Implement different types of health checks
- container liveness
- startup
- shutdown