Skip to content

Commit

Permalink
Add PD performance health check to gce/vm-performance
Browse files Browse the repository at this point in the history
fix: 358314141
Change-Id: I213fd65af6772c1e08cf9a0c01614faec25720e4
GitOrigin-RevId: 210e78ee71560d71a683ba28001b922ba8de61db
  • Loading branch information
vinay-vgs authored and copybara-github committed Oct 15, 2024
1 parent acda703 commit 2851aea
Show file tree
Hide file tree
Showing 12 changed files with 387 additions and 9 deletions.
6 changes: 3 additions & 3 deletions gcpdiag/runbook/gce/generalized_steps.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,9 @@ def execute(self):
elif mark_no_ops_agent:
op.add_skipped(vm,
reason='Ops Agent not installed on the VM, '
'Unable to fetch memory utilisation data via metrics'
'Unable to fetch memory utilisation data via metrics\n'
'Falling back to check for Memory related error messages '
'in Serial logs')
'in Serial logs\n')
else:
op.add_ok(vm, reason=op.prep_msg(op.SUCCESS_REASON))

Expand Down Expand Up @@ -161,7 +161,7 @@ def execute(self):
reason='Ops Agent not installed on the VM, '
'Unable to fetch disk utilisation data via metrics.\n'
'Falling back to check for filesystem utilization related'
' messages in Serial logs')
' messages in Serial logs\n')
# Fallback to check for filesystem utilization related messages in Serial logs
fs_util = VmSerialLogsCheck()
fs_util.project_id = self.project_id
Expand Down
52 changes: 52 additions & 0 deletions gcpdiag/runbook/gce/snapshots/vm_performance.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ gce/vm-performance: Google Compute Engine VM performance checks
- Disk space high utilisation
- High Disk IOPS utilisation
- High Disk Throughput utilisation
- Disk Health check
- Check for Live Migrations
- Usualy Error checks in Serial console logs

Expand Down Expand Up @@ -77,6 +78,31 @@ gce/vm-performance: Google Compute Engine VM performance checks
You may check if VM is facing high memory utilisation from GuestOS side using `free -m`
or `cat /proc/meminfo` commands.

[AUTOMATED STEP]: Checking if instance disks are healthy

- gcpdiag-gce-vm-performance/faulty-linux-ssh [FAIL]
[REASON]
You might experience slower/poor performance with your disk 'persistent-disk-0' due to an
ongoing issue with our Compute Engine or Persistent Disk infrastructure. We're working
to resolve this as quickly as possible.

[REMEDIATION]
To better understand the situation with your Compute Engine or Persistent Disks,
could you please take a look at the Google Cloud Status page:

https://status.cloud.google.com

This page provides real-time updates on the health of Google Cloud services.

Additionally, it may be helpful to check the Service Health dashboard in your
Google Cloud Console for any reported incidents:

https://console.cloud.google.com/servicehealth/incidents

If you don't find any information about an ongoing issue related to your concern,
please don't hesitate to reach out to Google Cloud Support by creating a support case.
They'll be happy to investigate further and assist you.

[AUTOMATED STEP]: Checking if VM's Boot disk space utilization is within optimal levels.

- gcpdiag-gce-vm-performance/faulty-linux-ssh [FAIL]
Expand Down Expand Up @@ -170,6 +196,7 @@ gce/vm-performance: Google Compute Engine VM performance checks
- Disk space high utilisation
- High Disk IOPS utilisation
- High Disk Throughput utilisation
- Disk Health check
- Check for Live Migrations
- Usualy Error checks in Serial console logs

Expand Down Expand Up @@ -261,6 +288,31 @@ gce/vm-performance: Google Compute Engine VM performance checks
You may check if VM is facing high memory utilisation from GuestOS side using `free -m`
or `cat /proc/meminfo` commands.

[AUTOMATED STEP]: Checking if instance disks are healthy

- gcpdiag-gce-vm-performance/faulty-windows-ssh [FAIL]
[REASON]
You might experience slower/poor performance with your disk 'persistent-disk-0' due to an
ongoing issue with our Compute Engine or Persistent Disk infrastructure. We're working
to resolve this as quickly as possible.

[REMEDIATION]
To better understand the situation with your Compute Engine or Persistent Disks,
could you please take a look at the Google Cloud Status page:

https://status.cloud.google.com

This page provides real-time updates on the health of Google Cloud services.

Additionally, it may be helpful to check the Service Health dashboard in your
Google Cloud Console for any reported incidents:

https://console.cloud.google.com/servicehealth/incidents

If you don't find any information about an ongoing issue related to your concern,
please don't hesitate to reach out to Google Cloud Support by creating a support case.
They'll be happy to investigate further and assist you.

[AUTOMATED STEP]: Checking if VM's Boot disk space utilization is within optimal levels.

- gcpdiag-gce-vm-performance/faulty-windows-ssh [FAIL]
Expand Down
33 changes: 33 additions & 0 deletions gcpdiag/runbook/gce/templates/vm_performance.jinja
Original file line number Diff line number Diff line change
Expand Up @@ -272,3 +272,36 @@ To fix this issue:
https://cloud.google.com/compute/docs/disks/modify-persistent-disk#disk_type

{% endblock disk_io_usage_check_failure_remediation %}


{% block disk_health_check_step_message %}
Checking if instance disks are healthy
{% endblock disk_health_check_step_message %}

{% block disk_health_check_success_reason %}
Instance disk "{disk_name}" is healthy.
{% endblock disk_health_check_success_reason %}

{% block disk_health_check_failure_reason %}
You might experience slower/poor performance with your disk '{disk_name}' due to an
ongoing issue with our Compute Engine or Persistent Disk infrastructure. We're working
to resolve this as quickly as possible.
{% endblock disk_health_check_failure_reason %}

{% block disk_health_check_failure_remediation %}
To better understand the situation with your Compute Engine or Persistent Disks,
could you please take a look at the Google Cloud Status page:

https://status.cloud.google.com

This page provides real-time updates on the health of Google Cloud services.

Additionally, it may be helpful to check the Service Health dashboard in your
Google Cloud Console for any reported incidents:

https://console.cloud.google.com/servicehealth/incidents

If you don't find any information about an ongoing issue related to your concern,
please don't hesitate to reach out to Google Cloud Support by creating a support case.
They'll be happy to investigate further and assist you.
{% endblock disk_health_check_failure_remediation %}
63 changes: 57 additions & 6 deletions gcpdiag/runbook/gce/vm_performance.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ class VmPerformance(runbook.DiagnosticTree):
- Disk space high utilisation
- High Disk IOPS utilisation
- High Disk Throughput utilisation
- Disk Health check
- Check for Live Migrations
- Usualy Error checks in Serial console logs
"""
Expand Down Expand Up @@ -110,6 +111,7 @@ def build_tree(self):
self.add_step(parent=start, child=cpu_check)
self.add_step(parent=start, child=mem_check)
self.add_step(parent=cpu_check, child=CpuOvercommitmentCheck())
self.add_step(parent=start, child=DiskHealthCheck())
self.add_step(parent=start, child=disk_util_check)

# Check for PD slow Reads/Writes
Expand Down Expand Up @@ -220,6 +222,53 @@ def execute(self):
self.add_child(DiskIopsThroughputUtilisationChecks())


class DiskHealthCheck(runbook.Step):
"""Disk Health check"""

template = 'vm_performance::disk_health_check'

def execute(self):
"""Instance Disk health check"""

vm = gce.get_instance(project_id=op.get(flags.PROJECT_ID),
zone=op.get(flags.ZONE),
instance_name=op.get(flags.NAME))

start_formatted_string = op.get(
flags.START_TIME_UTC).strftime('%Y/%m/%d %H:%M:%S')
end_formatted_string = op.get(
flags.END_TIME_UTC).strftime('%Y/%m/%d %H:%M:%S')
within_str = f'within d\'{start_formatted_string}\', d\'{end_formatted_string}\''

for disk in vm.disks:
pd_health_metrics = monitoring.query(
op.get(flags.PROJECT_ID), """
fetch gce_instance
| metric 'compute.googleapis.com/instance/disk/performance_status'
| filter (metric.performance_status != 'Healthy')
| filter (resource.instance_id == '{}') &&
(metric.device_name == '{}')
| group_by 3m,
[value_performance_status_fraction_true:
fraction_true(value.performance_status)]
| every 3m
| filter value_performance_status_fraction_true > 0
| {}
""".format(vm.id, disk['deviceName'], within_str))

if pd_health_metrics:
op.add_failed(vm,
reason=op.prep_msg(op.FAILURE_REASON,
disk_name=disk['deviceName'],
start_time=start_formatted_string,
end_time=end_formatted_string),
remediation=op.prep_msg(op.FAILURE_REMEDIATION))
else:
op.add_ok(vm,
reason=op.prep_msg(op.SUCCESS_REASON,
disk_name=disk['deviceName']))


class CpuOvercommitmentCheck(runbook.Step):
"""Checking if CPU overcommited beyond threshold"""

Expand All @@ -244,7 +293,8 @@ def execute(self):
start_dt_utc_plus_5_mins = start_dt_utc + timedelta(minutes=5)
current_time_utc = datetime.now(timezone.utc)
within_hours = 9
if start_dt_utc_plus_5_mins > current_time_utc and vm.laststoptimestamp():
if (start_dt_utc_plus_5_mins > current_time_utc or
not vm.is_running) and vm.laststoptimestamp():
# Instance just starting up, CpuCount might not be available currently via metrics.
# Use instance's last stop time as EndTime for monitoring query
stop_dt_pst = datetime.strptime(vm.laststoptimestamp(),
Expand Down Expand Up @@ -280,10 +330,10 @@ def execute(self):
if cpu_count_query:
cpu_count = int(list(cpu_count_query.values())[0]['values'][0][0])
else:
op.info(
('CPU count info not available for the instance.\n'
'Please start the VM {} if it is not in running state.').format(
vm.short_path))
op.info((
'CPU count info not available for the instance.\n'
'Please start the VM {} if it is not in running state.\n').format(
vm.short_path))
return

# an acceptable average Scheduler Wait Time is 20 ms/s per vCPU.
Expand Down Expand Up @@ -357,7 +407,8 @@ def execute(self):
start_dt_utc_plus_5_mins = start_dt_utc + timedelta(minutes=5)
current_time_utc = datetime.now(timezone.utc)
within_hours = 9
if start_dt_utc_plus_5_mins > current_time_utc and vm.laststoptimestamp():
if (start_dt_utc_plus_5_mins > current_time_utc or
not vm.is_running) and vm.laststoptimestamp():
# Instance just starting up, CpuCount might not be available currently via metrics.
# Use instance's last stop time as EndTime for monitoring query
stop_dt_pst = datetime.strptime(vm.laststoptimestamp(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ This runbook is designed to assist you in investigating and understanding the un
- Disk space high utilisation
- High Disk IOPS utilisation
- High Disk Throughput utilisation
- Disk Health check
- Check for Live Migrations
- Usualy Error checks in Serial console logs

Expand Down Expand Up @@ -67,6 +68,8 @@ gcpdiag runbook --help

- [Vm Serial Logs Check](/runbook/steps/gce/vm-serial-logs-check)

- [Disk Health Check](/runbook/steps/gce/disk-health-check)

- [High Vm Disk Utilization](/runbook/steps/gce/high-vm-disk-utilization)

- [Vm Serial Logs Check](/runbook/steps/gce/vm-serial-logs-check)
Expand Down
52 changes: 52 additions & 0 deletions website/content/en/runbook/steps/gce/disk-health-check.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: "gce/Disk Health Check"
linkTitle: "Disk Health Check"
weight: 3
type: docs
description: >
Disk Health check
---

**Product**: [Compute Engine](https://cloud.google.com/compute)\
**Step Type**: AUTOMATED STEP

### Description

None

### Failure Reason

You might experience slower/poor performance with your disk '{disk_name}' due to an
ongoing issue with our Compute Engine or Persistent Disk infrastructure. We're working
to resolve this as quickly as possible.

### Failure Remediation

To better understand the situation with your Compute Engine or Persistent Disks,
could you please take a look at the Google Cloud Status page:

https://status.cloud.google.com

This page provides real-time updates on the health of Google Cloud services.

Additionally, it may be helpful to check the Service Health dashboard in your
Google Cloud Console for any reported incidents:

https://console.cloud.google.com/servicehealth/incidents

If you don't find any information about an ongoing issue related to your concern,
please don't hesitate to reach out to Google Cloud Support by creating a support case.
They'll be happy to investigate further and assist you.

### Success Reason

Instance disk "{disk_name}" is healthy.



<!--
This file is auto-generated. DO NOT EDIT
Make pages changes in the corresponding jinja template
or python code
-->
8 changes: 8 additions & 0 deletions website/content/en/runbook/steps/logs/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
title: "LOGS"
linkTitle: "logs"
type: docs
weight: 2
---

All steps available in logs
54 changes: 54 additions & 0 deletions website/content/en/runbook/steps/logs/logs-check.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: "logs/Logs Check"
linkTitle: "Logs Check"
weight: 3
type: docs
description: >
Assess if a given log query is present or not..
---

**Product**: \
**Step Type**: AUTOMATED STEP

### Description

Checks if a log attribute has a bad or good pattern

### Failure Reason

A known bad value is present within the checked log entry indicating a problem

### Failure Remediation

View Cloud logging to get more details of how to what is causing this issue.

Run the following cloud logging query in GCP.

Query:
{query}

### Success Reason

The expected good value is present within the checked log entry.

### Uncertain Reason

We are not sure of the outcome manually check this cloud logging

### Uncertain Remediation

View Cloud logging to get more details of how to what is causing this issue.

Run the following cloud logging query in GCP.

Query:
{query}



<!--
This file is auto-generated. DO NOT EDIT
Make pages changes in the corresponding jinja template
or python code
-->
8 changes: 8 additions & 0 deletions website/content/en/runbook/steps/nat/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
title: "NAT"
linkTitle: "nat"
type: docs
weight: 2
---

All steps available in nat
Loading

0 comments on commit 2851aea

Please sign in to comment.