Add PD performance health check to gce/vm-performance

fix: 358314141 Change-Id: I213fd65af6772c1e08cf9a0c01614faec25720e4 GitOrigin-RevId: 210e78ee71560d71a683ba28001b922ba8de61db
GoogleCloudPlatform · Oct 15, 2024 · 2851aea · 2851aea
1 parent acda703
commit 2851aea
Show file tree

Hide file tree

Showing 12 changed files with 387 additions and 9 deletions.
diff --git a/gcpdiag/runbook/gce/generalized_steps.py b/gcpdiag/runbook/gce/generalized_steps.py
@@ -94,9 +94,9 @@ def execute(self):
     elif mark_no_ops_agent:
       op.add_skipped(vm,
                      reason='Ops Agent not installed on the VM, '
-                     'Unable to fetch memory utilisation data via metrics'
+                     'Unable to fetch memory utilisation data via metrics\n'
                      'Falling back to check for Memory related error messages '
-                     'in Serial logs')
+                     'in Serial logs\n')
     else:
       op.add_ok(vm, reason=op.prep_msg(op.SUCCESS_REASON))
 
@@ -161,7 +161,7 @@ def execute(self):
                      reason='Ops Agent not installed on the VM, '
                      'Unable to fetch disk utilisation data via metrics.\n'
                      'Falling back to check for filesystem utilization related'
-                     ' messages in Serial logs')
+                     ' messages in Serial logs\n')
       # Fallback to check for filesystem utilization related messages in Serial logs
       fs_util = VmSerialLogsCheck()
       fs_util.project_id = self.project_id

diff --git a/gcpdiag/runbook/gce/snapshots/vm_performance.txt b/gcpdiag/runbook/gce/snapshots/vm_performance.txt
@@ -13,6 +13,7 @@ gce/vm-performance:  Google Compute Engine VM performance checks
     - Disk space high utilisation
     - High Disk IOPS utilisation
     - High Disk Throughput utilisation
+    - Disk Health check
     - Check for Live Migrations
     - Usualy Error checks in Serial console logs
 
@@ -77,6 +78,31 @@ gce/vm-performance:  Google Compute Engine VM performance checks
      You may check if VM is facing high memory utilisation from GuestOS side using `free -m`
      or `cat /proc/meminfo` commands.
 
+[AUTOMATED STEP]: Checking if instance disks are healthy
+
+   - gcpdiag-gce-vm-performance/faulty-linux-ssh                          [FAIL]
+     [REASON]
+     You might experience slower/poor performance with your disk 'persistent-disk-0' due to an
+     ongoing issue with our Compute Engine or Persistent Disk infrastructure. We're working
+     to resolve this as quickly as possible.
+
+     [REMEDIATION]
+     To better understand the situation with your Compute Engine or Persistent Disks,
+     could you please take a look at the Google Cloud Status page:
+
+     https://status.cloud.google.com
+
+     This page provides real-time updates on the health of Google Cloud services.
+
+     Additionally, it may be helpful to check the Service Health dashboard in your
+     Google Cloud Console for any reported incidents:
+
+     https://console.cloud.google.com/servicehealth/incidents
+
+     If you don't find any information about an ongoing issue related to your concern,
+     please don't hesitate to reach out to Google Cloud Support by creating a support case.
+     They'll be happy to investigate further and assist you.
+
 [AUTOMATED STEP]: Checking if VM's Boot disk space utilization is within optimal levels.
 
    - gcpdiag-gce-vm-performance/faulty-linux-ssh                          [FAIL]
@@ -170,6 +196,7 @@ gce/vm-performance:  Google Compute Engine VM performance checks
     - Disk space high utilisation
     - High Disk IOPS utilisation
     - High Disk Throughput utilisation
+    - Disk Health check
     - Check for Live Migrations
     - Usualy Error checks in Serial console logs
 
@@ -261,6 +288,31 @@ gce/vm-performance:  Google Compute Engine VM performance checks
      You may check if VM is facing high memory utilisation from GuestOS side using `free -m`
      or `cat /proc/meminfo` commands.
 
+[AUTOMATED STEP]: Checking if instance disks are healthy
+
+   - gcpdiag-gce-vm-performance/faulty-windows-ssh                        [FAIL]
+     [REASON]
+     You might experience slower/poor performance with your disk 'persistent-disk-0' due to an
+     ongoing issue with our Compute Engine or Persistent Disk infrastructure. We're working
+     to resolve this as quickly as possible.
+
+     [REMEDIATION]
+     To better understand the situation with your Compute Engine or Persistent Disks,
+     could you please take a look at the Google Cloud Status page:
+
+     https://status.cloud.google.com
+
+     This page provides real-time updates on the health of Google Cloud services.
+
+     Additionally, it may be helpful to check the Service Health dashboard in your
+     Google Cloud Console for any reported incidents:
+
+     https://console.cloud.google.com/servicehealth/incidents
+
+     If you don't find any information about an ongoing issue related to your concern,
+     please don't hesitate to reach out to Google Cloud Support by creating a support case.
+     They'll be happy to investigate further and assist you.
+
 [AUTOMATED STEP]: Checking if VM's Boot disk space utilization is within optimal levels.
 
    - gcpdiag-gce-vm-performance/faulty-windows-ssh                        [FAIL]

diff --git a/gcpdiag/runbook/gce/templates/vm_performance.jinja b/gcpdiag/runbook/gce/templates/vm_performance.jinja
@@ -272,3 +272,36 @@ To fix this issue:
     https://cloud.google.com/compute/docs/disks/modify-persistent-disk#disk_type
 
 {% endblock disk_io_usage_check_failure_remediation %}
+
+
+{% block  disk_health_check_step_message %}
+Checking if instance disks are healthy
+{% endblock  disk_health_check_step_message %}
+
+{% block  disk_health_check_success_reason %}
+Instance disk "{disk_name}" is healthy.
+{% endblock  disk_health_check_success_reason %}
+
+{% block  disk_health_check_failure_reason %}
+You might experience slower/poor performance with your disk '{disk_name}' due to an
+ongoing issue with our Compute Engine or Persistent Disk infrastructure. We're working
+to resolve this as quickly as possible.
+{% endblock  disk_health_check_failure_reason %}
+
+{% block  disk_health_check_failure_remediation %}
+To better understand the situation with your Compute Engine or Persistent Disks,
+could you please take a look at the Google Cloud Status page:
+
+https://status.cloud.google.com
+
+This page provides real-time updates on the health of Google Cloud services.
+
+Additionally, it may be helpful to check the Service Health dashboard in your
+Google Cloud Console for any reported incidents:
+
+https://console.cloud.google.com/servicehealth/incidents
+
+If you don't find any information about an ongoing issue related to your concern,
+please don't hesitate to reach out to Google Cloud Support by creating a support case.
+They'll be happy to investigate further and assist you.
+{% endblock  disk_health_check_failure_remediation %}
diff --git a/gcpdiag/runbook/gce/vm_performance.py b/gcpdiag/runbook/gce/vm_performance.py
@@ -40,6 +40,7 @@ class VmPerformance(runbook.DiagnosticTree):
     - Disk space high utilisation
     - High Disk IOPS utilisation
     - High Disk Throughput utilisation
+    - Disk Health check
     - Check for Live Migrations
     - Usualy Error checks in Serial console logs
   """
@@ -110,6 +111,7 @@ def build_tree(self):
     self.add_step(parent=start, child=cpu_check)
     self.add_step(parent=start, child=mem_check)
     self.add_step(parent=cpu_check, child=CpuOvercommitmentCheck())
+    self.add_step(parent=start, child=DiskHealthCheck())
     self.add_step(parent=start, child=disk_util_check)
 
     # Check for PD slow Reads/Writes
@@ -220,6 +222,53 @@ def execute(self):
       self.add_child(DiskIopsThroughputUtilisationChecks())
 
 
+class DiskHealthCheck(runbook.Step):
+  """Disk Health check"""
+
+  template = 'vm_performance::disk_health_check'
+
+  def execute(self):
+    """Instance Disk health check"""
+
+    vm = gce.get_instance(project_id=op.get(flags.PROJECT_ID),
+                          zone=op.get(flags.ZONE),
+                          instance_name=op.get(flags.NAME))
+
+    start_formatted_string = op.get(
+        flags.START_TIME_UTC).strftime('%Y/%m/%d %H:%M:%S')
+    end_formatted_string = op.get(
+        flags.END_TIME_UTC).strftime('%Y/%m/%d %H:%M:%S')
+    within_str = f'within d\'{start_formatted_string}\', d\'{end_formatted_string}\''
+
+    for disk in vm.disks:
+      pd_health_metrics = monitoring.query(
+          op.get(flags.PROJECT_ID), """
+            fetch gce_instance
+              | metric 'compute.googleapis.com/instance/disk/performance_status'
+              | filter (metric.performance_status != 'Healthy')
+              | filter (resource.instance_id == '{}') &&
+                (metric.device_name == '{}')
+              | group_by 3m,
+                  [value_performance_status_fraction_true:
+                    fraction_true(value.performance_status)]
+              | every 3m
+              | filter value_performance_status_fraction_true > 0
+              | {}
+            """.format(vm.id, disk['deviceName'], within_str))
+
+      if pd_health_metrics:
+        op.add_failed(vm,
+                      reason=op.prep_msg(op.FAILURE_REASON,
+                                         disk_name=disk['deviceName'],
+                                         start_time=start_formatted_string,
+                                         end_time=end_formatted_string),
+                      remediation=op.prep_msg(op.FAILURE_REMEDIATION))
+      else:
+        op.add_ok(vm,
+                  reason=op.prep_msg(op.SUCCESS_REASON,
+                                     disk_name=disk['deviceName']))
+
+
 class CpuOvercommitmentCheck(runbook.Step):
   """Checking if CPU overcommited beyond threshold"""
 
@@ -244,7 +293,8 @@ def execute(self):
       start_dt_utc_plus_5_mins = start_dt_utc + timedelta(minutes=5)
       current_time_utc = datetime.now(timezone.utc)
       within_hours = 9
-      if start_dt_utc_plus_5_mins > current_time_utc and vm.laststoptimestamp():
+      if (start_dt_utc_plus_5_mins > current_time_utc or
+          not vm.is_running) and vm.laststoptimestamp():
         # Instance just starting up, CpuCount might not be available currently via metrics.
         # Use instance's last stop time as EndTime for monitoring query
         stop_dt_pst = datetime.strptime(vm.laststoptimestamp(),
@@ -280,10 +330,10 @@ def execute(self):
         if cpu_count_query:
           cpu_count = int(list(cpu_count_query.values())[0]['values'][0][0])
         else:
-          op.info(
-              ('CPU count info not available for the instance.\n'
-               'Please start the VM {} if it is not in running state.').format(
-                   vm.short_path))
+          op.info((
+              'CPU count info not available for the instance.\n'
+              'Please start the VM {} if it is not in running state.\n').format(
+                  vm.short_path))
           return
 
       # an acceptable average Scheduler Wait Time is 20 ms/s per vCPU.
@@ -357,7 +407,8 @@ def execute(self):
     start_dt_utc_plus_5_mins = start_dt_utc + timedelta(minutes=5)
     current_time_utc = datetime.now(timezone.utc)
     within_hours = 9
-    if start_dt_utc_plus_5_mins > current_time_utc and vm.laststoptimestamp():
+    if (start_dt_utc_plus_5_mins > current_time_utc or
+        not vm.is_running) and vm.laststoptimestamp():
       # Instance just starting up, CpuCount might not be available currently via metrics.
       # Use instance's last stop time as EndTime for monitoring query
       stop_dt_pst = datetime.strptime(vm.laststoptimestamp(),

diff --git a/website/content/en/runbook/diagnostic-trees/gce/vm-performance.md b/website/content/en/runbook/diagnostic-trees/gce/vm-performance.md
@@ -23,6 +23,7 @@ This runbook is designed to assist you in investigating and understanding the un
     - Disk space high utilisation
     - High Disk IOPS utilisation
     - High Disk Throughput utilisation
+    - Disk Health check
     - Check for Live Migrations
     - Usualy Error checks in Serial console logs
 
@@ -67,6 +68,8 @@ gcpdiag runbook --help
 
   - [Vm Serial Logs Check](/runbook/steps/gce/vm-serial-logs-check)
 
+  - [Disk Health Check](/runbook/steps/gce/disk-health-check)
+
   - [High Vm Disk Utilization](/runbook/steps/gce/high-vm-disk-utilization)
 
   - [Vm Serial Logs Check](/runbook/steps/gce/vm-serial-logs-check)

diff --git a/website/content/en/runbook/steps/gce/disk-health-check.md b/website/content/en/runbook/steps/gce/disk-health-check.md
@@ -0,0 +1,52 @@
+---
+title: "gce/Disk Health Check"
+linkTitle: "Disk Health Check"
+weight: 3
+type: docs
+description: >
+  Disk Health check
+---
+
+**Product**: [Compute Engine](https://cloud.google.com/compute)\
+**Step Type**: AUTOMATED STEP
+
+### Description
+
+None
+
+### Failure Reason
+
+You might experience slower/poor performance with your disk '{disk_name}' due to an
+ongoing issue with our Compute Engine or Persistent Disk infrastructure. We're working
+to resolve this as quickly as possible.
+
+### Failure Remediation
+
+To better understand the situation with your Compute Engine or Persistent Disks,
+could you please take a look at the Google Cloud Status page:
+
+https://status.cloud.google.com
+
+This page provides real-time updates on the health of Google Cloud services.
+
+Additionally, it may be helpful to check the Service Health dashboard in your
+Google Cloud Console for any reported incidents:
+
+https://console.cloud.google.com/servicehealth/incidents
+
+If you don't find any information about an ongoing issue related to your concern,
+please don't hesitate to reach out to Google Cloud Support by creating a support case.
+They'll be happy to investigate further and assist you.
+
+### Success Reason
+
+Instance disk "{disk_name}" is healthy.
+
+
+
+<!--
+This file is auto-generated. DO NOT EDIT
+
+Make pages changes in the corresponding jinja template
+or python code
+-->
diff --git a/website/content/en/runbook/steps/logs/_index.md b/website/content/en/runbook/steps/logs/_index.md
@@ -0,0 +1,8 @@
+---
+title: "LOGS"
+linkTitle: "logs"
+type: docs
+weight: 2
+---
+
+All steps available in logs
diff --git a/website/content/en/runbook/steps/logs/logs-check.md b/website/content/en/runbook/steps/logs/logs-check.md
@@ -0,0 +1,54 @@
+---
+title: "logs/Logs Check"
+linkTitle: "Logs Check"
+weight: 3
+type: docs
+description: >
+  Assess if a given log query is present or not..
+---
+
+**Product**: \
+**Step Type**: AUTOMATED STEP
+
+### Description
+
+Checks if a log attribute has a bad or good pattern
+
+### Failure Reason
+
+A known bad value is present within the checked log entry indicating a problem
+
+### Failure Remediation
+
+View Cloud logging to get more details of how to what is causing this issue.
+
+Run the following cloud logging query in GCP.
+
+Query:
+{query}
+
+### Success Reason
+
+The expected good value is present within the checked log entry.
+
+### Uncertain Reason
+
+We are not sure of the outcome manually check this cloud logging
+
+### Uncertain Remediation
+
+View Cloud logging to get more details of how to what is causing this issue.
+
+Run the following cloud logging query in GCP.
+
+Query:
+{query}
+
+
+
+<!--
+This file is auto-generated. DO NOT EDIT
+
+Make pages changes in the corresponding jinja template
+or python code
+-->
diff --git a/website/content/en/runbook/steps/nat/_index.md b/website/content/en/runbook/steps/nat/_index.md
@@ -0,0 +1,8 @@
+---
+title: "NAT"
+linkTitle: "nat"
+type: docs
+weight: 2
+---
+
+All steps available in nat