Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom metrics and histogram SLIs and partition by option #3208

Merged
merged 3 commits into from
Sep 7, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 106 additions & 42 deletions docs/en/observability/slo-create.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -14,51 +14,115 @@ To create an SLO, go to *Observability → SLOs*:
* If you're creating your first SLO, you'll see an introductory page. Click the *Create SLO* button.
* If you've created SLOs before, click the *Create new SLO* button in the upper-right corner of the page.

From here, complete the following:
From here, complete the following steps:

* <<define-sli,Define your service-level indicator (SLI)>>
* <<set-slo>>
* <<slo-describe>>
. <<define-sli,Define your service-level indicator (SLI)>>.
. <<set-slo>>.
. <<slo-describe>>.

[discrete]
[[define-sli]]
== Define your SLI
= Define your SLI

The type of SLI to use depends on the location of your data. If you're creating an SLO based on raw logs coming from your services, you can use a custom KQL SLI. If you're creating an SLO based on services using application performance monitoring (APM), you can use an APM latency or APM availability SLI. See the following table for more on each type of SLI:
The type of SLI to use depends on the location of your data:

[cols="1,1"]
|===
* <<custom-kql-sli, Custom KQL>> — create an SLI based on raw logs coming from your services.
* <<custom-metric-sli, Custom metric>> — create an SLI to define custom equations from metric fields in your indices.
* <<histogram-metric-sli, Histogram metric>> — create an SLI based on histogram metrics.
* <<apm-latency-and-availability-sli, APM latency and APM availability>> — create an SLI based on services using application performance monitoring (APM).

|*Custom KQL*
|This indicator can be based on any elasticsearch index or index pattern you have. You define two queries, one that yields the good events and one that yields the total events from your index.
[discrete]
[[custom-kql-sli]]
== Custom KQL

*Example:* You could define a custom KQL indicator based on the `service-logs` with the good query defined as `nested.field.response.latency <= 100 and nested.field.env : “production”` and the total query defined as `nested.field.env : “production”`.
|*APM latency*
|This indicator is based on the APM data that we receive from your instrumented services and a latency threshold.
Create an indicator based on any of your {es} indices or index patterns. You define two queries: one that yields the good events from your index, and one that yields the total events from your index.

*Example:* You could define an indicator on an APM service named `banking-service` for the `production` environment, and the transaction name `POST /deposit` with a latency threshold value of 300ms.
|*APM availability*
|This indicator is based on the APM data that we receive from your instrumented services.
*Example:* You can define a custom KQL indicator based on the `service-logs` with the *good query* defined as `nested.field.response.latency <= 100 and nested.field.env : “production”` and the *total query* defined as `nested.field.env : “production”`.

*Example:* You could define an indicator on an APM service named `search-service` for the `prod` environment, and the transaction name `POST /search`.
|===
When defining a custom KQL SLI, set the following fields:

* *Index* — The index or index pattern you want to base the SLI on. For example, `service-logs`.
* *Timestamp field* — The timestamp field used by the index.
* *Query filter* — A KQL filter to specify relevant criteria by which to filter the index documents.
* *Good query* — The query yielding events that are considered good or successful. For example, `nested.field.response.latency <= 100 and nested.field.env : “production”`
* *Total query* — The query yielding all events to take into account for computing the SLI. For example, `nested.field.env : “production”`.
* *Partition by* — Create an SLO for each value of the field you enter.

[discrete]
[[custom-kql-sli]]
=== Custom KQL
When defining a custom KQL SLI, you can set the following fields:
[[custom-metric-sli]]
== Custom metric

* *Index* — The index or index pattern you want to base the SLI upon.
* *Timestamp field* — The timestamp field used by the index.
* *Query filter* — A filter to apply on the index.
* *Good query* — The KQL query yielding the good events to take into account for computing the SLI.
* *Total query* — The KQL query yielding all events to take into account for computing the SLI.
Create an indicator to define custom equations from metric fields in your indices.

*Example:* You can define *Good events* as the sum of the field `processor.processed` with a filter of `"processor.outcome: \"success\""`, and the *Total events* as the sum of `processor.processed` with a filter of `"processor.outcome: *"`.

When defining a custom metric SLI, set the following fields:

* *Source*
** *Index* — The index or index pattern you want to base the SLI on. For example, `my-service-*`.
** *Timestamp field* — The timestamp field used by the index.
** *Query filter* — A KQL filter to specify relevant criteria by which to filter the index documents. For example, `'field.environment : "production" and service.name : "my-service"'`.
* *Good events*
** *Metric [A-Z]* — The field that is aggregated using the `sum` aggregation for good events. For example, `processor.processed`.
** *Filter [A-Z]* — The filter to apply to the metric for good events. For example, `"processor.outcome: \"success\""`.
** *Equation* — The equation that calculates the good metric. For example, `A`.
* *Total events*
** *Metric [A-Z]* — The field that is aggregated using the `sum` aggregation for total events. For example, `processor.processed`
** *Filter [A-Z]* — The filter to apply to the metric for total events. For example, `"processor.outcome: *"`
** *Equation* — The equation that calculates the total metric. For example, `A`.
* *Partition by* — Create an SLO for each value of the field you enter.

[discrete]
[[histogram-metric-sli]]
== Histogram metric

Histograms record data in a compressed format and can record latency and delay metrics. You can create an SLI based on histogram metrics using a `range` aggregation or a `value_count` aggregation for both the good and total events. Filtering with KQL queries is supported on both event types.

When using a `range` aggregation, both the `from` and `to` thresholds are required for the range and the events are the total number of events within that range. The range includes the `from` value and excludes the `to` value.

*Example:* You can define your *Good events* using the `processor.latency` field with a filter of `"processor.outcome: \"success\""`, and your *Total events* using the `processor.latency` field with a filter of `"processor.outcome: *"`.

When defining a histogram metric SLI, set the following fields:

* *Source*
** *Index* — The index or index pattern you want to base the SLI on. For example, `my-service-*`.
** *Timestamp field* — The timestamp field used by the index.
** *Query filter* — A KQL filter to specify relevant criteria by which to filter the index documents. For example, `field.environment : "production" and service.name : "my-service"`.
* *Good events*
** *Aggregation* — The type of aggregation to use for good events, either *Value count* or *Range*.
** *Field* — The field used to aggregate events considered good or successful. For example, `processor.latency`.
** *From* — (`range` aggregation only) The starting value of the range for good events. For example, `0`.
** *To* — (`range` aggregation only) The ending value of the range for good events. For example, `100`.
** *KQL filter* — The filter for good events. For example, `"processor.outcome: \"success\""`.
* *Total events*
** *Aggregation* — The type of aggregation to use for total events, either *Value count* or *Range*.
** *Field* — The field used to aggregate total events. For example, `processor.latency`.
** *From* — (`range` aggregation only) The starting value of the range for total events. For example, `0`.
** *To* — (`range` aggregation only) The ending value of the range for total events. For example, `100`.
** *KQL filter* — The filter for total events. For example, `"processor.outcome : *"`.
* *Partition by* — Create an SLO for each value of the field you enter.

[discrete]
[[apm-latency-and-availability-sli]]
== APM latency and APM availability

[discrete]
[[apm-latency-sli]]
=== APM latency

Create an indicator based on the APM data that you received from your instrumented services and a latency threshold.

*Example:* You can define an indicator on an APM service named `banking-service` for the `production` environment, and the transaction name `POST /deposit` with a latency threshold value of 300ms.

[discrete]
[[apm-availability-sli]]
=== APM availability

Create an indicator based on the APM data received from your instrumented services.

*Example:* You can define an indicator on an APM service named `search-service` for the `production` environment, and the transaction name `POST /search`.

=== APM latency or availability
When defining an APM latency or APM availability SLI, you can set the following fields:
When defining an APM latency or APM availability SLI, set the following fields:

* *Service name* — The APM service name.
* *Service environment* — Either `all` or the specific environment.
Expand All @@ -69,54 +133,54 @@ When defining an APM latency or APM availability SLI, you can set the following

[discrete]
[[set-slo]]
== Set your objectives
= Set your objectives
After defining your SLI, you need to set your objectives. To set your objectives, complete the following:

* <<slo-budgeting-method, Select your budgeting method>>
* <<slo-time-window, Set your time window>>
* <<slo-target, Set your target/SLO percentage>>
. <<slo-budgeting-method, Select your budgeting method>>
. <<slo-time-window, Set your time window>>
. <<slo-target, Set your target/SLO percentage>>

[discrete]
[[slo-budgeting-method]]
=== Select your budgeting method
== Select your budgeting method
You can select either an *occurrences* or a *timeslices* budgeting method:

[cols="1,1"]
|===
|*Occurrences*
| Uses the number of good events and the number of total events to compute the SLO.

*Example:* You have a 30 day rolling SLO with a 95% target, and over the past 30 days there were 1,355,700 total events. The error budget is `100-95 = 5%`, or about 66,785 bad events tolerated before violating the SLO.
*Example:* You have a 30 day rolling SLO with a 95% target, and, over the past 30 days, there were 1,355,700 total events. The error budget is `100-95 = 5%`, or about 66,785 bad events are tolerated before violating the SLO.

If we had 1,300,000 good events over the same period, the observed value is `Good Events / Total Events = 0.95891421 => 95.89%`.
If you had 1,300,000 good events over the same period, the observed value is `Good Events / Total Events = 0.95891421 => 95.89%`.
|*Timeslices*
| Breaks the overall time window into smaller slices of a defined duration and uses the number of good slices over the number of total slices to compute the SLO.
| Breaks the overall time window into smaller slices of a defined duration, and uses the number of good slices over the number of total slices to compute the SLO.

*Timeslice target (%)* - Individual timeslices target that determines if the slice is good or bad.
*Timeslice window (in minutes)* - The size of the timeslice window size.

*Example:* A 30 days rolling SLO defined with 5 min slices has a total of `30*24*12 = 8640` slices.
*Example:* A 30 day rolling SLO defined with five minute slices has a total of `30*24*12 = 8640` slices.
If the SLO target is 98%, we have a `100-98 = 2%` error budget or `8640 * 0.02 = 172` bad slices available before we violate the SLO.
|===

[discrete]
[[slo-time-window]]
=== Set your time window
Select the durations over which you want to compute your SLO. Then time window uses the data from the defined rolling period. For example, the last 30 days.
== Set your time window
Select the durations over which you want to compute your SLO. The time window uses the data from the defined rolling period. For example, the last 30 days.

[discrete]
[[slo-target]]
=== Set your target/SLO (%)
== Set your target/SLO (%)
The SLO target objective in percentage.

[discrete]
[[slo-describe]]
== Describe your SLO
= Describe your SLO
After setting your objectives, give your SLO a name, a short description, and add any relevant tags.

[discrete]
[[slo-alert-checkbox]]
== SLO burn rate alert rule
= SLO burn rate alert rule
When the *Create an SLO burn rate alert rule* checkbox is selected, the *Create rule* window opens immediately after you click the *Create SLO* button.
Here you can define your SLO burn rate alert rule.
For more information, see <<slo-burn-rate-alert, Create an SLO burn rate rule>>.