Skip to content

Latest commit

 

History

History
407 lines (305 loc) · 19.1 KB

specification.md

File metadata and controls

407 lines (305 loc) · 19.1 KB

Alert Generator Compliance Specification 1.0

Introduction

This document outlines the specification to follow to be Prometheus Alert-Generator compliant that includes

Input:

  1. The format of the alerting rules.

Output:

  1. Format of the alert.
  2. Payload format of the alerts sent to the Alertmanager. (Alertmanager is described in “Sending Alerts to Alertmanager” section)
  3. GET APIs to support, with their respective format.

Between Input and Output:

  1. How to maintain different states and lifecycles of an alert.
  2. When to send an alert out to the Alertmanager.

This document follows the RFC 2119 language.

Software MUST pass the test suite at {} to be called “Prometheus alert-generator compliant”.

The Setup

The setup is made up of 3 different components

  1. Sample receiver: to which samples are sent via Prometheus remote write protocol.
  2. Sample querier: which allows querying samples via PromQL using Prometheus style query APIs. Used to query ALERTS series generated by the alert-generator.
  3. Alert-generator: that does everything mentioned below in the doc - accepts the alerting rules, executes them, maintains the alert states, sends alerts to Alertmanager, and supports GET /api/v1/alerts and GET /api/v1/rules.

Only the alert-generator needs to follow the below specification while sample receiver and sample querier facilitate ingestion and query of time series data. They are optional to be part of the same software; all 3 components can be a single software or different softwares.

Alert Format

An alert in JSON MUST follow the following format:

{
  "labels": {
    "alertname": "<alertname>",
    "label1": "value1",
    "label2": "value2",
    "..."
  },
  "annotations": {
    "label1": "value1",
    "label2": "value2",
    "..."
  },
  "startsAt": "<RFC3339Millis time>",
  "endsAt": "<RFC3339Millis time>",
  "generatorURL": "<string>"
}
  • labels: MUST be present. The labels uniquely identify an alert.
  • annotations: MUST be present IF the alert has annotations. Annotations provide additional details about the alert which can change over time for the same alert.
  • startsAt: SHOULD be present. It is the time when the alert was triggered.
  • endsAt: SHOULD be present. It is the time when the alert MUST be considered inactive. Note that future alert updates MAY change this value.
  • generatorURL: SHOULD be present. It is a URL that takes the user to the query page for the source expression of the alert.

Alerting Rules

The Format

The alert-generator MUST accept the Prometheus style alerting rules configuration as described in the v2.33 docs for Alerting Rules with the following structure. It MUST be in either YAML format or an equivalent JSON format. Alert-generator MAY accept them as files on disk or via an API.

groups:
  [ - <rule_group> ]

<rule_group>

# The name of the group. MUST be unique within a file.
name: <string>

# How often rules in the group are evaluated.
[ interval: <duration> | default = 1m ]

rules:
  [ - <rule> ... ]

<rule>

# The name of the alert. MUST be a valid label value.
alert: <string>

# The PromQL expression to evaluate. Every evaluation cycle this is
# evaluated at the current time, and all resultant time series become
# pending/firing alerts.
expr: <string>

# Alerts are considered firing once they have been returned for this long.
# Alerts which have not yet fired for long enough are considered pending.
[ for: <duration> | default = 0s ]

# Labels to add or overwrite for each alert.
labels:
  [ <labelname>: <template_string> ]

# Annotations to add to each alert.
annotations:
  [ <labelname>: <template_string> ]

Example config

groups:
- name: example
  rules:
  - alert: HighRequestLatency
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
    for: 10m
    labels:
      severity: page
    annotations:
      summary: High request latency
  - alert: VeryHighRequestLatency
    expr: job:request_latency_seconds:mean5m{job="myjob"} > 1

Constraints

  1. Results of a rule evaluation MUST be available for any subsequent rules in the group during the same evaluation cycle that depend on these results. The order of rules MUST be the same as provided in the config.

  2. Labels and annotations in alerting rules MUST support all template variables and functions as described in the v2.33 template reference for the values with the following exceptions:

    • graphLink, tableLink: MAY be supported if the sample querier supports having a UI link for graph and table respectively.
    • tmpl, pathPrefix, safeHtml: MAY be supported if the software supports console template files.
    • strvalue: MAY be supported if there is a use case.
    • .ExternalLabels/$externalLabels, .ExternalURL/$externalURL: MAY be supported if the software supports configuring external labels and external URL.

    The config MUST NOT be rejected if it contains the optional template variables and/or functions listed above. They MUST result in an empty string if not supported.

Executing an Alerting Rule

The PromQL expression expr of the alerting rule MUST be executed against the Sample Querier as an instant query for the current time (called the “evaluation time” or the “group evaluation time”). This MUST be done at regular intervals and the interval MUST be the interval from the parent <rule_group> of this alerting rule, which MUST default to 1 minute if not specified in the config.

Processing Instant Query Result

Steps to follow in order to process the query result.

Step 1

Each element in the result vector of the instant query MUST produce a distinct alert, and labels of the element MUST become the labels of the alert.

For example if the result vector was

my_metric_total{job=”foo”, status=”500”} => 10
my_metric_total{job=”foo”, status=”400”} => 18

Then the corresponding alerts produced at this step MUST be

Alert1 = { “labels”: {“__name__”:”my_metric_total”,”job”=”foo”,”status”=”500”}, ... }
Alert2 = { “labels”: {“__name__”:”my_metric_total”,”job”=”foo”,”status”=”400”}, ... }

Step 2

The labels and annotation templates from the alerting rule MUST be run for each of these alerts individually with label-value data for the template coming from the corresponding element from the result vector. The output of the template execution MUST be added to the alert as labels and annotations respectively. Labels from template execution MUST override the existing labels in the alert.

Step 3

The alert name from the alerting rule (HighRequestLatency from the example above) MUST be added to the labels of the alert with the label name as alertname. It MUST override any existing alertname label.

The labels of the alert at the end of step 3 MUST uniquely identify an alert.

Alert MUST be in pending, firing or inactive state. The pending State Conditions”, firing State Conditions”, inactive State Conditions” and “Time Series to Create” MUST be checked after this step 3.

Step 4

The execution of an alerting rule MUST error out immediately and MUST NOT send any alerts as described in “Sending Alerts to Alertmanager” section or add samples to samples receiver as described in “Time Series to Create” section if there is more than one alert with the same labels at the end of step 3 . This error MUST be reflected in the output of GET /api/v1/rules API as described in “APIs to Support” section below.

pending State Conditions

Alerts MUST start in pending state if the for duration is non-zero. This evaluation time when the alert is first created is referred to as ActiveAt for that alert.

The alert MUST stay in pending state during an evaluation if the difference between evaluation time and ActiveAt is less than for duration (as specified in the alerting rule).

If the annotation values change at any evaluation, the latest annotations MUST be updated to the alert immediately.

firing State Conditions

If the difference between the current evaluation time and ActiveAt is greater than or equal to the for duration (as specified by the alerting rule), the alert MUST go into firing state immediately. This evaluation time when it first went into firing state is referred to as FiredAt.

For a zero for duration, the alert MUST directly go into firing state the first time the alert was created and skip the initial pending state. This evaluation time when the alert is first created is referred to as ActiveAt.

For a non-zero for duration that is less than the group evaluation interval, the alert MUST go into firing state during the next evaluation after it went into pending state and not in between evaluations if the alert does not become inactive in the next evaluation.

If the annotation values change at any evaluation, the latest annotations MUST be updated to the alert immediately.

inactive State Conditions

If an existing pending or firing state alert was not produced by the current evaluation of the rule, that alert MUST immediately go into inactive state. This evaluation time where the alert got resolved is referred to as ResolvedAt.

Any alerts in future evaluations with the same labels as an inactive alert MUST be considered as a new alert and MUST follow the pending and firing state conditions as stated above. The ActiveAt and ResolvedAt MUST be set again according to the above conditions for pending and firing states.

Time Series to Create

At the end of a single alerting rule evaluation, for each active alert (i.e. pending and firing state alerts), the alert-generator MUST produce the following time series with a sample value of 1 and a timestamp matching the evaluation time and send it over to the sample receiver. The sample MUST be immediately available via the sample querier for the evaluation of subsequent rules in the parent rule group during the same evaluation cycle.

Series labels (sorted) MUST have these labels only.

{
  "__name__": "ALERTS",
  "alertstate": "pending" or "firing",
  <all labels from the alert including "alertname">
}

The alertstate MUST be ”pending” for a pending state alert and MUST be ”firing” for a firing state alert.

The __name__ and alertstate labels MUST override any existing labels in the alert with the values above.

For example if the alert labels of a firing alert at the end of step 3 of processing instant query result were { “__name__”:“my_metric_name”, “alertstate”:“very_critical”, “alertname”:”HighRequestLatency”, “severity”:”page”}, then the labels for the time series would be { “__name__”:“ALERTS”, “alertstate”:“firing”, “alertname”:”HighRequestLatency”, “severity”:”page”}

Series MUST NOT be created for an alert that is in inactive state.

Sending Alerts to Alertmanager

Alertmanager is any software that accepts alerts to process further in the format described below in “Sending Alerts to Alertmanager” section, for example, Prometheus Alertmanager.

Alert-generators MUST send only firing and inactive state alerts to an alertmanager. The alerts MUST be sent only after the respective rule evaluations and not in between two evaluations.

The “Conditions for Sending firing Alerts” and “Conditions for Sending inactive Alerts” MUST be checked after the pending State Conditions”, firing State Conditions” and inactive State Conditions” steps.

The ResendDelay used for resending the alert SHOULD be configurable and MUST default to 1 minute.

Conditions for Sending firing Alerts

firing alerts MUST be sent to Alertmanager in the following scenarios:

  1. When it first went into firing state.
  2. The difference between current evaluation time and the last time the firing alert was sent to the alertmanager is more than ResendDelay.

This implies that the firing alert MUST be sent continuously with a fixed interval until it becomes inactive.

ResendDelay acts as a minimum interval while the actual interval MUST be the first >0 multiple of the group interval that is more than or equal to ResendDelay.

Conditions for Sending inactive Alerts

inactive alerts MUST be sent to Alertmanager in the following scenarios.

  1. When it first went into inactive state.
  2. The difference between current evaluation time and the last time the inactive alert was sent to the Alertmanager is more than ResendDelay AND the difference between current evaluation time and ResolvedAt is less than 15 minutes AND there is no new active alert (pending or firing state alert) with the same labels.

This implies that the inactive alert MUST be sent continuously with a fixed interval until 15 minutes after the ResolvedAt of the alert, or until a new alert is created with the same labels.

ResendDelay acts as a minimum interval while the actual interval MUST be the first >0 multiple of the group interval that is more than or equal to ResendDelay.

Payload Format to Send from the Alert Generator

The alerts can be sent out in any format as required by the software while it MUST be translatable to the following JSON format

[
  <alert 1>,
  <alert 2>,
  ... 
]

Where the structure of each <alert i> MUST be the same as described in the “Alert Format” section above.

The parameters of each alert MUST be set as follows:

APIs to Support

GET /api/v1/rules

This API returns all the rules along with its health and associated alerts. The alert-generator MUST support the GET /api/v1/rules API. The API MUST return a JSON containing the following fields and it MAY add additional custom fields anywhere in the JSON.

{
  "status" : "success",
  "data": {
    "groups": [ <group>, ]
  }
}

<group>

{
  "name": "<string>",
  "interval": <float>,
  "lastEvaluation": "<RFC3339Millis time>",
  "rules": [ <rule>, ]
}
  • name is the group name as present in the config.
  • interval is the group evaluation interval in float seconds as present in the file.
  • lastEvaluation is the timestamp of the last time the group was evaluated.

An example for a custom field here that is used by Prometheus is ”file”: “<string>”, which tells where on disk is the rule file that contains this group.

<rule>

{
  "type": "alerting",
  "name": "<string>",
  "query": "<string>",
  "duration": <float>,
  "labels": {
    "label1": "value1",
    "label2": "value2",
    "..."
  },
  "annotations": {
    "label1": "value1",
    "label2": "value2",
    "..."
  },
  "lastEvaluation": "<RFC3339Millis time>",
  "evaluationTime": <float>,
  "health": "<string>",
  "state": "<string>",
  "alerts": [ <alert>, ],
  [ "lastError": "<string>" ]
}
  • name, query, labels, annotations are exactly the same as present in the alerting rule config.
  • duration is the same as for period in float seconds.
  • lastEvaluation is the timestamp of the last time the rule was evaluated.
  • evaluationTime is the time taken to completely evaluate the rule in float seconds.
  • health is the health of rule evaluation. It MUST be one of "ok", "err", "unknown".
  • state must be one of these under following scenarios
    • "pending": at least 1 alert in the rule in pending state and no other alert in firing state.
    • "firing": at least 1 alert in the rule in firing state.
    • "inactive": no alert in the rule in firing or pending state.
  • alerts is the list of all the alerts in this rule that are currently pending or firing.
  • lastError MUST be omitted or empty "" when health is "ok". lastError MUST be non empty for other health states containing the error faced while executing the rule.

<alert>

{
  "activeAt": "<RFC3339Millis time>",
  "state": "firing",
  "value": "<string>",
  "labels": {
    "label1": "value1",
    "label2": "value2",
    "..."
  },
  "annotations": {
    "label1": "value1",
    "label2": "value2",
    "..."
  }
}
  • activeAt is the time the alert was created as described in the above specification as ActiveAt.
  • state MUST be one of "pending", "firing" or "inactive".
  • value is the stringified float value of the instant query sample that created this alert.
  • labels are the labels of the alert.
  • annotations are the annotations of the alert.

GET /api/v1/alerts

This API returns the union of all alerts across all the rules as seen in GET /api/v1/rules. The alert-generator MUST support the GET /api/v1/alerts API. The API MUST return a JSON containing the following fields and it MAY add additional custom fields anywhere in the JSON.

{
  "status": "success",
  "data": {
    "alerts": [ <alert>, ]
  }
}

<alert>

{
  "activeAt": "<RFC3339Millis time>",
  "state": "firing",
  "value": "<string>",
  "labels": {
    "label1": "value1",
    "label2": "value2",
    "..."
  },
  "annotations": {
    "label1": "value1",
    "label2": "value2",
    "..."
  }
}
  • activeAt is the time the alert was created as described in the above specification as ActiveAt.
  • state MUST be one of "pending", "firing" or "inactive".
  • value is the stringified float value of the sample that created this alert.
  • labels are the labels of the alert.
  • annotations are the annotations of the alert.