Azure Monitor may skip a metric collection depending on the timing #37204

zmoog · 2023-11-26T16:19:51Z

I found a regression on Azure Metrics while testing the PR #37177 to backport #36823 to 8.11.

The metricset does not collect all the metric values during each iteration.

Here's what I can see on the Compute VM dashboard:

Two consecutive collections, and then one skipped; then repeat.
As a workaround, if I set the collection period to a lower value than the smallest time grain — for example, from 300s to 60s — I get data for every metric with a time grain of PT5M or greater.

zmoog · 2023-11-26T16:29:25Z

The root cause of this behavior is setting referenceTime to "now" at the beginning of every metric processing.

To avoid being affected by timing, I set referenceTime once (truncated to seconds) at the beginning of each collection, and use this value only when the metricset need to decide if collect the values or not.

This approach is working fine in the PR #37203

tetianakravchenko · 2023-12-08T11:30:12Z

I've tested version 8.11.2 and still can reproduce the issue:

I've enabled all datastreams with the default collection period (300s) and could see the periodic gaps:

After changing collection period to 60s on the compute_vm datastream and running it for 2h:
TODO

additionally:

it would be helpful during the debugging to add the metricset name to the structured logs

zmoog · 2023-12-09T07:40:52Z

@tetianakravchenko, I ported the PR #37365 changes to 8.11 and build a a custom agent for testing:

version: 8.11.3-SNAPSHOT
image id: 51ad6522096f

Changes:

Round reference time to the nearest second instead of truncating it.

zmoog · 2023-12-09T07:45:55Z

It's a drop in replacement for the agent running on your local stack.

You can stop or assign another policy to the current agent, and start a new zmoog/elastic-agent:8.11.3-SNAPSHOT agent with:

export ENROLLMENT_TOKEN="<token>"

docker run  --rm \
  --network elastic-package-stack_default \
  -e FLEET_URL=https://fleet-server:8220 \
  -e FLEET_ENROLL=true \
  -e FLEET_INSECURE=true \
  -e FLEET_ENROLLMENT_TOKEN=${ENROLLMENT_TOKEN} \
  -i zmoog/elastic-agent:8.11.3-SNAPSHOT

zmoog · 2023-12-09T07:57:51Z

I am still not able to reproduce the issue running agent 8.11.2 on my three testing agents:

local stack: macOS Sonoma / arm
Mac mini: macOS Sonoma / intel
EC2

I have no gaps.

I ran 8.11.3-SNAPSHOT on the local stack and EC2 using two different Azure account and I get no gaps:

zmoog · 2023-12-09T07:59:03Z

Can you try 8.11.3-SNAPSHOT in the environment where you get gaps?

zmoog · 2023-12-12T14:00:28Z

@tetianakravchenko, could you test this image on your local stack?

version: 8.11.3-SNAPSHOT
image id: d1ef3377af3b

Changes:

remove rounding and switch to timestamp comparison with a 1-second jitter.

In this build, I'm keeping the jitter value to 1-second.

tetianakravchenko · 2023-12-12T19:30:24Z

@zmoog
after about 1,5h of running - no gaps:

needs_update: false I see only for the metrics with time grain 1H

@timestamp

### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ```

@timestamp

### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04)

@timestamp

### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04)

@timestamp

…ics (#37557) * Fix the reference time rounding on Azure Metrics (#37365) ### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04) * Remove extra changelog entry --------- Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>

@timestamp

### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04) Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>

@timestamp

### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](elastic#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ```

zmoog self-assigned this Nov 26, 2023

zmoog added the bug label Nov 26, 2023

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 26, 2023

zmoog mentioned this issue Nov 26, 2023

Fix unintended skip in metric collection on Azure Monitor #37203

Merged

6 tasks

zmoog closed this as completed in #37203 Nov 28, 2023

tetianakravchenko reopened this Dec 8, 2023

zmoog mentioned this issue Dec 8, 2023

Fix the reference time rounding on Azure Metrics #37365

Merged

6 tasks

zmoog added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Dec 12, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 12, 2023

zmoog closed this as completed in #37365 Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Monitor may skip a metric collection depending on the timing #37204

Azure Monitor may skip a metric collection depending on the timing #37204

zmoog commented Nov 26, 2023 •

edited

Loading

zmoog commented Nov 26, 2023 •

edited

Loading

tetianakravchenko commented Dec 8, 2023 •

edited

Loading

zmoog commented Dec 9, 2023

zmoog commented Dec 9, 2023 •

edited

Loading

zmoog commented Dec 9, 2023

zmoog commented Dec 9, 2023

zmoog commented Dec 12, 2023

tetianakravchenko commented Dec 12, 2023

Azure Monitor may skip a metric collection depending on the timing #37204

Azure Monitor may skip a metric collection depending on the timing #37204

Comments

zmoog commented Nov 26, 2023 • edited Loading

zmoog commented Nov 26, 2023 • edited Loading

tetianakravchenko commented Dec 8, 2023 • edited Loading

zmoog commented Dec 9, 2023

zmoog commented Dec 9, 2023 • edited Loading

zmoog commented Dec 9, 2023

zmoog commented Dec 9, 2023

zmoog commented Dec 12, 2023

tetianakravchenko commented Dec 12, 2023

zmoog commented Nov 26, 2023 •

edited

Loading

zmoog commented Nov 26, 2023 •

edited

Loading

tetianakravchenko commented Dec 8, 2023 •

edited

Loading

zmoog commented Dec 9, 2023 •

edited

Loading