Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GCE recommended alerts for GPU VMs #774

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

LujieDuan
Copy link
Contributor

@LujieDuan LujieDuan commented Jun 26, 2024

b/343920635
This PR adds 4 new alert templates covering:

  • GPU utilization - per instance and all instances;
  • GPU memory utilization (i.e., percentage of memory used) - per instance and all instances. The memory utilization is calculated as the ratio of "used" label and sum of "used" and "free" of agent.googleapis.com/gpu/memory/bytes_used, and the queries are implemented using MQL.

Screenshots of alert notifications:

@LujieDuan LujieDuan marked this pull request as ready for review June 26, 2024 18:29
@LujieDuan LujieDuan force-pushed the lujieduan-gce-gpu-recommended-alerts branch from a5b648d to d33b8b0 Compare June 27, 2024 14:44
@LujieDuan LujieDuan marked this pull request as draft June 27, 2024 14:48
@LujieDuan LujieDuan marked this pull request as ready for review June 27, 2024 14:48
@@ -69,3 +69,31 @@ alert_policy_templates:
related_integrations:
- id: gce
platform: GCP
-
id: gpu-utilization-too-high
description: "Monitors GPU utilization across all GCE VMs in the current project and will notify you if the GPU utilization on any VM instance rises above 90% for 5 minutes or more. This requires the Ops Agent to be installed on VMs to collect the gpu utilization metric."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu -> GPU (update for all descriptions please)

"trigger": {
"count": 1
},
"query": "{ fetch gce_instance\n | metric 'agent.googleapis.com/gpu/memory/bytes_used'\n | filter (metadata.system_labels.name == '${INSTANCE_NAME}')\n | filter metric.memory_state == 'used'\n | group_by 5m, [value_bytes_used_mean: mean(value.bytes_used)]\n | every 5m\n | group_by [metric.gpu_number, metric.model, metric.uuid, resource.instance_id, resource.project_id, resource.zone, metadata.system_labels.name], [value_bytes_used_mean_aggregate: aggregate(value_bytes_used_mean)]\n; fetch gce_instance\n | metric 'agent.googleapis.com/gpu/memory/bytes_used' \n | filter (metadata.system_labels.name == '${INSTANCE_NAME}')\n | group_by 5m, [value_bytes_used_mean: mean(value.bytes_used)]\n | every 5m\n | group_by [metric.gpu_number, metric.model, metric.uuid, resource.instance_id, resource.project_id, resource.zone, metadata.system_labels.name], [value_bytes_used_mean_aggregate: aggregate(value_bytes_used_mean)] }\n| ratio\n| mul (100)\n| cast_units ('%')\n| every 5m\n| condition val() > 0.9 '10^2.%'"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mild preference to express these new recommended alerts as equivalent PromQL instead of MQL going forward

cc @lyanco

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong preference. We're announcing MQL Deprecation on July 17th.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants