Skip to content

Commit

Permalink
added guids to HPC/AI alertis
Browse files Browse the repository at this point in the history
  • Loading branch information
jhajduk-microsoft committed Oct 14, 2024
1 parent 22c234f commit 3135e40
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,3 @@ There are numerous ways to implement AI solution on Azure, and each comes with i
## AI on Infrastructure (BYOM)

Running AI workloads on Azure infrastructure involves monitoring each of the components of the solution, including virtual machines, storage, and networking. Refer to the defined metrics in [HPC](../../specialized/hpc/Alerting-and-Monitoring.md). For monitoring the GPU/CPU metrics, use [Moneo](https://github.com/Azure/Moneo)

### Azure OpenAI with RAG

### Azure OpenAI without RAG
8 changes: 8 additions & 0 deletions services/StorageCache/AmlFilesystems/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
autoMitigate: true
autoResolve: true
autoResolveTime: 0:10:00
guid: 9d086772-1887-4893-8b9f-7e5169398bae
references:
- name: OST Files Free
description: Log an alert if OSTFilesFree is below 15%
Expand Down Expand Up @@ -109,6 +110,7 @@
autoMitigate: true
autoResolve: true
autoResolveTime: 0:10:00
guid: 8f231351-c123-4e4c-8631-9978e641a3ca
references:
- name: OST Bytes Available
description: Log an alert if OSTBytesAvailable is below 15%
Expand Down Expand Up @@ -165,6 +167,7 @@
autoMitigate: true
autoResolve: true
autoResolveTime: 0:10:00
guid: 4eeca790-a804-4453-b339-73ea425610bc
references:
- name: OST Bytes Used
description: Log an alert if OSTByteUsed is above 85%
Expand Down Expand Up @@ -221,6 +224,7 @@
autoMitigate: true
autoResolve: true
autoResolveTime: 0:10:00
guid: 59298086-ec77-4f47-b2ef-b853b79e31cb
references:
- name: MDT Files Free
description: Log an alert if MDTFilesFree is below 15%
Expand Down Expand Up @@ -277,6 +281,7 @@
autoMitigate: true
autoResolve: true
autoResolveTime: 0:10:00
guid: 2feba8fd-ff1e-4f48-bc01-6e2996edafa6
references:
- name: MDT Files Used
description: Log an alert if MDTFilesUsed is above 85%
Expand Down Expand Up @@ -333,6 +338,7 @@
autoMitigate: true
autoResolve: true
autoResolveTime: 0:10:00
guid: 48fc094d-8a00-4d3c-86d3-3230c7e5881a
references:
- name: MDT Files Available
description: Log an alert if MDTBytesAvailable is below 15%
Expand Down Expand Up @@ -389,6 +395,7 @@
autoMitigate: true
autoResolve: true
autoResolveTime: 0:10:00
guid: ecec6f93-af7e-4071-b35d-cd70b3f16581
references:
- name: MDT Bytes Used
description: Log an alert if MDTBytesUsed is above 85%
Expand Down Expand Up @@ -445,6 +452,7 @@
autoMitigate: true
autoResolve: true
autoResolveTime: 0:10:00
guid: ebd68fdd-9672-43e8-b7d5-6e479210535d
references:
- name: Uptime
description: Total number of client input/output operations per second
Expand Down

0 comments on commit 3135e40

Please sign in to comment.