Systematic approach to troubleshooting failures in deployment #560

mzabaluev · 2024-09-10T18:05:12Z

mzabaluev
Sep 10, 2024
Maintainer

The developers should have instruments to investigate causes of failure in deployment.

At the very minimum, we should be able to get Rust backtraces whenever a service panics.

But beyond that, I suggest that a share of on-call incidents should be followed up by analysis of the failure attached to the OnCall service, tracing the cause to a GitHub bug issue with the post-mortem information, or if the information at hand is insufficient, an issue suggesting additional instrumentation to troubleshoot the problem.

Seeing how incidents tend to be repetitive, a developer on call should not be obliged to investigated every incident that occurred during their shift and was automatically resolved, but at least one incident per shift should be investigated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systematic approach to troubleshooting failures in deployment #560

{{title}}

Replies: 0 comments

Select a reply

Systematic approach to troubleshooting failures in deployment #560

mzabaluev Sep 10, 2024 Maintainer

Replies: 0 comments

mzabaluev
Sep 10, 2024
Maintainer