Add GCE recommended alerts for GPU VMs #774

LujieDuan · 2024-06-26T18:27:04Z

b/343920635
This PR adds 4 new alert templates covering:

GPU utilization - per instance and all instances;
GPU memory utilization (i.e., percentage of memory used) - per instance and all instances. The memory utilization is calculated as the ratio of "used" label and sum of "used" and "free" of agent.googleapis.com/gpu/memory/bytes_used, and the queries are implemented using MQL.

Screenshots of alert notifications:

yqlu · 2024-06-27T18:30:02Z

alerts/google-gce/metadata.yaml

@@ -69,3 +69,31 @@ alert_policy_templates:
  related_integrations:
    - id: gce
      platform: GCP
+-
+  id: gpu-utilization-too-high
+  description: "Monitors GPU utilization across all GCE VMs in the current project and will notify you if the GPU utilization on any VM instance rises above 90% for 5 minutes or more. This requires the Ops Agent to be installed on VMs to collect the gpu utilization metric."


gpu -> GPU (update for all descriptions please)

yqlu · 2024-06-27T18:35:07Z

alerts/google-gce/gpu-memory-utilization-too-high-within-vm.v1.json

+        "trigger": {
+          "count": 1
+        },
+        "query": "{ fetch gce_instance\n  | metric 'agent.googleapis.com/gpu/memory/bytes_used'\n  | filter (metadata.system_labels.name == '${INSTANCE_NAME}')\n  | filter metric.memory_state == 'used'\n  | group_by 5m, [value_bytes_used_mean: mean(value.bytes_used)]\n  | every 5m\n  | group_by [metric.gpu_number, metric.model, metric.uuid, resource.instance_id, resource.project_id, resource.zone, metadata.system_labels.name], [value_bytes_used_mean_aggregate: aggregate(value_bytes_used_mean)]\n; fetch gce_instance\n  | metric 'agent.googleapis.com/gpu/memory/bytes_used' \n  | filter (metadata.system_labels.name == '${INSTANCE_NAME}')\n  | group_by 5m, [value_bytes_used_mean: mean(value.bytes_used)]\n  | every 5m\n  | group_by [metric.gpu_number, metric.model, metric.uuid, resource.instance_id, resource.project_id, resource.zone, metadata.system_labels.name], [value_bytes_used_mean_aggregate: aggregate(value_bytes_used_mean)] }\n| ratio\n| mul (100)\n| cast_units ('%')\n| every 5m\n| condition val() > 0.9 '10^2.%'"


Mild preference to express these new recommended alerts as equivalent PromQL instead of MQL going forward

cc @lyanco

Strong preference. We're announcing MQL Deprecation on July 17th.

LujieDuan marked this pull request as ready for review June 26, 2024 18:29

Add GCE recommended alerts for GPU VMs

d33b8b0

LujieDuan force-pushed the lujieduan-gce-gpu-recommended-alerts branch from a5b648d to d33b8b0 Compare June 27, 2024 14:44

LujieDuan marked this pull request as draft June 27, 2024 14:48

LujieDuan marked this pull request as ready for review June 27, 2024 14:48

yqlu requested changes Jun 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GCE recommended alerts for GPU VMs #774

Add GCE recommended alerts for GPU VMs #774

LujieDuan commented Jun 26, 2024 •

edited

Loading

yqlu Jun 27, 2024

yqlu Jun 27, 2024

lyanco Jun 27, 2024

Add GCE recommended alerts for GPU VMs #774

Are you sure you want to change the base?

Add GCE recommended alerts for GPU VMs #774

Conversation

LujieDuan commented Jun 26, 2024 • edited Loading

yqlu Jun 27, 2024

Choose a reason for hiding this comment

yqlu Jun 27, 2024

Choose a reason for hiding this comment

lyanco Jun 27, 2024

Choose a reason for hiding this comment

LujieDuan commented Jun 26, 2024 •

edited

Loading