-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure Monitor may skip a metric collection depending on the timing #37204
Comments
The root cause of this behavior is setting To avoid being affected by timing, I set This approach is working fine in the PR #37203 |
@tetianakravchenko, I ported the PR #37365 changes to 8.11 and build a a custom agent for testing:
Changes:
|
It's a drop in replacement for the agent running on your local stack. You can stop or assign another policy to the current agent, and start a new
|
Can you try 8.11.3-SNAPSHOT in the environment where you get gaps? |
@tetianakravchenko, could you test this image on your local stack?
Changes:
In this build, I'm keeping the jitter value to 1-second. |
@zmoog
|
### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ```
### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04)
### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04)
…ics (#37557) * Fix the reference time rounding on Azure Metrics (#37365) ### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04) * Remove extra changelog entry --------- Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ``` (cherry picked from commit 824dd04) Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
### What Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. If the time since the last collection < time grain duration, then the metrics skip the collection. For example, given the following scenario: #### Scenario A: collect PT1M metrics every 60s - time grain: PT1M (one minute, or 60s) - collection interval: 60s In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection. #### Scenario B: collect PT15M metrics every 60s - time grain: PT5M (five minutes, or 300s) - collection interval: 60s In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections. #### The jitter During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. ### Why During a testing session on 8.11.2, we [noticed](elastic#37204 (comment)) one out of four agents skipped some metrics collections. The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s). ![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36) The 1-second difference may happen due to an inaccurate rounding in the reference time. For example, suppose the following two events occur: 1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection. 2. The timestamp is 2023-12-08T10:58:32.999Z. In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation. This problem happened to one test agent. However, if it happens to one agent, it can happen to others. ### Extended Structured Logging We also added new fields to the debug structured logs: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq { "log.level": "debug", "@timestamp": "2024-01-05T15:03:12.235+0100", "log.logger": "azure monitor client", "log.origin": { "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate", "file.name": "azure/metric_registry.go", "file.line": 80 }, "message": "MetricRegistry: Metric needs an update", "service.name": "metricbeat", "needs_update": true, "reference_time": "2024-01-05T14:03:07.197Z", "last_collection_time": "2024-01-05T14:02:07.199Z", "time_since_last_collection_seconds": 66.035681, "time_grain": "PT1M", "time_grain_duration_seconds": 60, "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm", "namespace": "Microsoft.Compute/virtualMachines", "aggregation": "Total", "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total", "ecs.version": "1.6.0" } ``` Here's an example using `jq`: ```shell $ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines .aggregation aggregation .needs_update .reference_time .last_collection_time time_since_last_collection_seconds .time_grain_duration_seconds .time_grain Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 60.999661 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 61.795341 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 62.080088 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 64.929579 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 65.632209 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 67.832918 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 68.576239 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 69.927988 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.351148 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 70.872058 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.47401 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 72.971242 60 PT1M Microsoft.Compute/virtualMachines Average true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 73.143605 60 PT1M Microsoft.Compute/virtualMachines Total true 2024-01-05T14:16:07.193Z 2024-01-05T14:15:07.193Z 74.831489 60 PT1M ```
I found a regression on Azure Metrics while testing the PR #37177 to backport #36823 to 8.11.
The metricset does not collect all the metric values during each iteration.
Here's what I can see on the Compute VM dashboard:
The text was updated successfully, but these errors were encountered: