From 50feedb69835815816b444918c2979568cf254af Mon Sep 17 00:00:00 2001 From: Mary Frances Hull Date: Fri, 11 Aug 2023 08:20:45 -0700 Subject: [PATCH 1/3] add incident management documentation --- .../incident-management.md | 24 +++++++++++++++++++ 1 file changed, 24 insertions(+) create mode 100644 content/en/docs/internal-documentation/incident-management.md diff --git a/content/en/docs/internal-documentation/incident-management.md b/content/en/docs/internal-documentation/incident-management.md new file mode 100644 index 00000000..e703e765 --- /dev/null +++ b/content/en/docs/internal-documentation/incident-management.md @@ -0,0 +1,24 @@ +--- +title: Incident Management +linkTitle: Incident Management +--- + +An incident refers to an event that can happen at any given time and may cause a decrease in the quality or complete outage of one or more of our services. Internal or external customers, our monitoring and alerting systems, or a member of the SRE team can raise an incident. + +Preparedness for major incidents is crucial. We have established the following Incident Management processes to ensure SREs can follow predetermined procedures: + +- [Incident Management Process](https://source.redhat.com/groups/public/service-delivery/service_delivery_wiki/incident_management_process) + +- [Incident Response Cheatsheet](https://github.com/openshift/ops-sop/blob/master/policies/incident_response.asciidoc) + +- [Automated Incident Management Process (WebRCA)](https://source.redhat.com/groups/public/service-delivery/service_delivery_wiki/automated_incident_management_process) + + +## Coverage +Layered Products SRE (LPSRE) provides 24x7 coverage and support +with primary and secondary on-call SREs responsible for handling production-related issues. + +If you need to escalate an incident, please refer to the + [Layered Products SRE Escalation Procedure](https://source.redhat.com/groups/public/sre/wiki/cs_sre_escalation_procedure). + + From 18b02c3ad8f8375ecf13512774744d1d929d6c2a Mon Sep 17 00:00:00 2001 From: Mary Frances Hull Date: Fri, 11 Aug 2023 08:44:27 -0700 Subject: [PATCH 2/3] fix linter issues --- .../internal-documentation/incident-management.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/content/en/docs/internal-documentation/incident-management.md b/content/en/docs/internal-documentation/incident-management.md index e703e765..9c41abf0 100644 --- a/content/en/docs/internal-documentation/incident-management.md +++ b/content/en/docs/internal-documentation/incident-management.md @@ -3,9 +3,13 @@ title: Incident Management linkTitle: Incident Management --- -An incident refers to an event that can happen at any given time and may cause a decrease in the quality or complete outage of one or more of our services. Internal or external customers, our monitoring and alerting systems, or a member of the SRE team can raise an incident. +An incident refers to an event that can happen at any given time and + may cause a decrease in the quality or complete outage of one or + more of our services. Internal or external customers, our monitoring + and alerting systems, or a member of the SRE team can raise an incident. -Preparedness for major incidents is crucial. We have established the following Incident Management processes to ensure SREs can follow predetermined procedures: +Preparedness for major incidents is crucial. We have established the + following Incident Management processes to ensure SREs can follow predetermined procedures: - [Incident Management Process](https://source.redhat.com/groups/public/service-delivery/service_delivery_wiki/incident_management_process) @@ -13,12 +17,10 @@ Preparedness for major incidents is crucial. We have established the following I - [Automated Incident Management Process (WebRCA)](https://source.redhat.com/groups/public/service-delivery/service_delivery_wiki/automated_incident_management_process) - ## Coverage + Layered Products SRE (LPSRE) provides 24x7 coverage and support with primary and secondary on-call SREs responsible for handling production-related issues. If you need to escalate an incident, please refer to the [Layered Products SRE Escalation Procedure](https://source.redhat.com/groups/public/sre/wiki/cs_sre_escalation_procedure). - - From 8ac533357a9385d6624c780f52ff36f9a95831f5 Mon Sep 17 00:00:00 2001 From: Mary Frances Hull Date: Fri, 11 Aug 2023 12:20:57 -0700 Subject: [PATCH 3/3] pr review --- .../docs/internal-documentation/incident-management.md | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/content/en/docs/internal-documentation/incident-management.md b/content/en/docs/internal-documentation/incident-management.md index 9c41abf0..72a3638f 100644 --- a/content/en/docs/internal-documentation/incident-management.md +++ b/content/en/docs/internal-documentation/incident-management.md @@ -3,11 +3,6 @@ title: Incident Management linkTitle: Incident Management --- -An incident refers to an event that can happen at any given time and - may cause a decrease in the quality or complete outage of one or - more of our services. Internal or external customers, our monitoring - and alerting systems, or a member of the SRE team can raise an incident. - Preparedness for major incidents is crucial. We have established the following Incident Management processes to ensure SREs can follow predetermined procedures: @@ -19,8 +14,9 @@ Preparedness for major incidents is crucial. We have established the ## Coverage -Layered Products SRE (LPSRE) provides 24x7 coverage and support -with primary and secondary on-call SREs responsible for handling production-related issues. +Layered Products SRE (LPSRE) provides 24x7 coverage and support. If you need to escalate an incident, please refer to the [Layered Products SRE Escalation Procedure](https://source.redhat.com/groups/public/sre/wiki/cs_sre_escalation_procedure). + +**NOTE:** Only escalate an incident if the standard manual notification process using an OHSS ticket has failed.