Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Monitor may skip a metric collection depending on the timing #37204

Closed
zmoog opened this issue Nov 26, 2023 · 8 comments · Fixed by #37203 or #37365
Closed

Azure Monitor may skip a metric collection depending on the timing #37204

zmoog opened this issue Nov 26, 2023 · 8 comments · Fixed by #37203 or #37365
Assignees
Labels
bug Team:Cloud-Monitoring Label for the Cloud Monitoring team

Comments

@zmoog
Copy link
Contributor

zmoog commented Nov 26, 2023

I found a regression on Azure Metrics while testing the PR #37177 to backport #36823 to 8.11.

CleanShot 2023-11-26 at 09 27 09@2x

The metricset does not collect all the metric values during each iteration.

Here's what I can see on the Compute VM dashboard:

  1. Two consecutive collections, and then one skipped; then repeat.
  2. As a workaround, if I set the collection period to a lower value than the smallest time grain — for example, from 300s to 60s — I get data for every metric with a time grain of PT5M or greater.
@zmoog zmoog self-assigned this Nov 26, 2023
@zmoog zmoog added the bug label Nov 26, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 26, 2023
@zmoog
Copy link
Contributor Author

zmoog commented Nov 26, 2023

The root cause of this behavior is setting referenceTime to "now" at the beginning of every metric processing.

To avoid being affected by timing, I set referenceTime once (truncated to seconds) at the beginning of each collection, and use this value only when the metricset need to decide if collect the values or not.

This approach is working fine in the PR #37203

@tetianakravchenko
Copy link
Contributor

tetianakravchenko commented Dec 8, 2023

I've tested version 8.11.2 and still can reproduce the issue:

  • I've enabled all datastreams with the default collection period (300s) and could see the periodic gaps:
Screenshot 2023-12-08 at 12 20 20

After changing collection period to 60s on the compute_vm datastream and running it for 2h:
TODO

additionally:

  • it would be helpful during the debugging to add the metricset name to the structured logs

@zmoog
Copy link
Contributor Author

zmoog commented Dec 9, 2023

@tetianakravchenko, I ported the PR #37365 changes to 8.11 and build a a custom agent for testing:

  • version: 8.11.3-SNAPSHOT
  • image id: 51ad6522096f

Changes:

  • Round reference time to the nearest second instead of truncating it.

@zmoog
Copy link
Contributor Author

zmoog commented Dec 9, 2023

It's a drop in replacement for the agent running on your local stack.

You can stop or assign another policy to the current agent, and start a new zmoog/elastic-agent:8.11.3-SNAPSHOT agent with:

export ENROLLMENT_TOKEN="<token>"

docker run  --rm \
  --network elastic-package-stack_default \
  -e FLEET_URL=https://fleet-server:8220 \
  -e FLEET_ENROLL=true \
  -e FLEET_INSECURE=true \
  -e FLEET_ENROLLMENT_TOKEN=${ENROLLMENT_TOKEN} \
  -i zmoog/elastic-agent:8.11.3-SNAPSHOT

@zmoog
Copy link
Contributor Author

zmoog commented Dec 9, 2023

I am still not able to reproduce the issue running agent 8.11.2 on my three testing agents:

  • local stack: macOS Sonoma / arm
  • Mac mini: macOS Sonoma / intel
  • EC2

I have no gaps.

I ran 8.11.3-SNAPSHOT on the local stack and EC2 using two different Azure account and I get no gaps:

CleanShot 2023-12-09 at 08 48 39@2x

CleanShot 2023-12-09 at 08 49 13@2x

@zmoog
Copy link
Contributor Author

zmoog commented Dec 9, 2023

Can you try 8.11.3-SNAPSHOT in the environment where you get gaps?

@zmoog
Copy link
Contributor Author

zmoog commented Dec 12, 2023

@tetianakravchenko, could you test this image on your local stack?

  • version: 8.11.3-SNAPSHOT
  • image id: d1ef3377af3b

Changes:

  • remove rounding and switch to timestamp comparison with a 1-second jitter.

In this build, I'm keeping the jitter value to 1-second.

@zmoog zmoog added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Dec 12, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 12, 2023
@tetianakravchenko
Copy link
Contributor

@zmoog
after about 1,5h of running - no gaps:
Screenshot 2023-12-12 at 20 28 05

needs_update: false I see only for the metrics with time grain 1H

zmoog added a commit that referenced this issue Jan 5, 2024
### What

Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. 

If the time since the last collection < time grain duration, then the metrics skip the collection.

For example, given the following scenario:

#### Scenario A: collect PT1M metrics every 60s

- time grain: PT1M (one minute, or 60s)
- collection interval: 60s

In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection.

#### Scenario B: collect PT15M metrics every 60s

- time grain: PT5M (five minutes, or 300s)
- collection interval: 60s

In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections.

#### The jitter

During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. 

### Why

During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections.

The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s).

![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36)

The 1-second difference may happen due to an inaccurate rounding in the reference time.

For example, suppose the following two events occur:

1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection.
2. The timestamp is 2023-12-08T10:58:32.999Z. 

In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation.

This problem happened to one test agent. However, if it happens to one agent, it can happen to others.

### Extended Structured Logging

We also added new fields to the debug structured logs:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq                                                                                                                                                    
{
  "log.level": "debug",
  "@timestamp": "2024-01-05T15:03:12.235+0100",
  "log.logger": "azure monitor client",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate",
    "file.name": "azure/metric_registry.go",
    "file.line": 80
  },
  "message": "MetricRegistry: Metric needs an update",
  "service.name": "metricbeat",
  "needs_update": true,
  "reference_time": "2024-01-05T14:03:07.197Z",
  "last_collection_time": "2024-01-05T14:02:07.199Z",
  "time_since_last_collection_seconds": 66.035681,
  "time_grain": "PT1M",
  "time_grain_duration_seconds": 60,
  "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm",
  "namespace": "Microsoft.Compute/virtualMachines",
  "aggregation": "Total",
  "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total",
  "ecs.version": "1.6.0"
}
```

Here's an example using `jq`:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r  '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines

.aggregation                            aggregation   .needs_update   .reference_time                 .last_collection_time           time_since_last_collection_seconds .time_grain_duration_seconds .time_grain
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        60.999661                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        61.795341                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        62.080088                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        64.929579                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        65.632209                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        67.832918                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        68.576239                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        69.927988                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.351148                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.872058                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.47401                           60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.971242                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        73.143605                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        74.831489                          60                           PT1M
```
mergify bot pushed a commit that referenced this issue Jan 5, 2024
### What

Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain.

If the time since the last collection < time grain duration, then the metrics skip the collection.

For example, given the following scenario:

#### Scenario A: collect PT1M metrics every 60s

- time grain: PT1M (one minute, or 60s)
- collection interval: 60s

In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection.

#### Scenario B: collect PT15M metrics every 60s

- time grain: PT5M (five minutes, or 300s)
- collection interval: 60s

In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections.

#### The jitter

During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling.

### Why

During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections.

The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s).

![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36)

The 1-second difference may happen due to an inaccurate rounding in the reference time.

For example, suppose the following two events occur:

1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection.
2. The timestamp is 2023-12-08T10:58:32.999Z.

In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation.

This problem happened to one test agent. However, if it happens to one agent, it can happen to others.

### Extended Structured Logging

We also added new fields to the debug structured logs:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq
{
  "log.level": "debug",
  "@timestamp": "2024-01-05T15:03:12.235+0100",
  "log.logger": "azure monitor client",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate",
    "file.name": "azure/metric_registry.go",
    "file.line": 80
  },
  "message": "MetricRegistry: Metric needs an update",
  "service.name": "metricbeat",
  "needs_update": true,
  "reference_time": "2024-01-05T14:03:07.197Z",
  "last_collection_time": "2024-01-05T14:02:07.199Z",
  "time_since_last_collection_seconds": 66.035681,
  "time_grain": "PT1M",
  "time_grain_duration_seconds": 60,
  "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm",
  "namespace": "Microsoft.Compute/virtualMachines",
  "aggregation": "Total",
  "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total",
  "ecs.version": "1.6.0"
}
```

Here's an example using `jq`:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r  '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines

.aggregation                            aggregation   .needs_update   .reference_time                 .last_collection_time           time_since_last_collection_seconds .time_grain_duration_seconds .time_grain
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        60.999661                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        61.795341                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        62.080088                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        64.929579                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        65.632209                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        67.832918                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        68.576239                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        69.927988                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.351148                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.872058                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.47401                           60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.971242                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        73.143605                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        74.831489                          60                           PT1M
```

(cherry picked from commit 824dd04)
mergify bot pushed a commit that referenced this issue Jan 5, 2024
### What

Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain.

If the time since the last collection < time grain duration, then the metrics skip the collection.

For example, given the following scenario:

#### Scenario A: collect PT1M metrics every 60s

- time grain: PT1M (one minute, or 60s)
- collection interval: 60s

In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection.

#### Scenario B: collect PT15M metrics every 60s

- time grain: PT5M (five minutes, or 300s)
- collection interval: 60s

In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections.

#### The jitter

During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling.

### Why

During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections.

The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s).

![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36)

The 1-second difference may happen due to an inaccurate rounding in the reference time.

For example, suppose the following two events occur:

1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection.
2. The timestamp is 2023-12-08T10:58:32.999Z.

In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation.

This problem happened to one test agent. However, if it happens to one agent, it can happen to others.

### Extended Structured Logging

We also added new fields to the debug structured logs:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq
{
  "log.level": "debug",
  "@timestamp": "2024-01-05T15:03:12.235+0100",
  "log.logger": "azure monitor client",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate",
    "file.name": "azure/metric_registry.go",
    "file.line": 80
  },
  "message": "MetricRegistry: Metric needs an update",
  "service.name": "metricbeat",
  "needs_update": true,
  "reference_time": "2024-01-05T14:03:07.197Z",
  "last_collection_time": "2024-01-05T14:02:07.199Z",
  "time_since_last_collection_seconds": 66.035681,
  "time_grain": "PT1M",
  "time_grain_duration_seconds": 60,
  "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm",
  "namespace": "Microsoft.Compute/virtualMachines",
  "aggregation": "Total",
  "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total",
  "ecs.version": "1.6.0"
}
```

Here's an example using `jq`:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r  '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines

.aggregation                            aggregation   .needs_update   .reference_time                 .last_collection_time           time_since_last_collection_seconds .time_grain_duration_seconds .time_grain
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        60.999661                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        61.795341                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        62.080088                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        64.929579                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        65.632209                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        67.832918                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        68.576239                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        69.927988                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.351148                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.872058                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.47401                           60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.971242                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        73.143605                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        74.831489                          60                           PT1M
```

(cherry picked from commit 824dd04)
zmoog added a commit that referenced this issue Jan 6, 2024
…ics (#37557)

* Fix the reference time rounding on Azure Metrics (#37365)

### What

Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain.

If the time since the last collection < time grain duration, then the metrics skip the collection.

For example, given the following scenario:

#### Scenario A: collect PT1M metrics every 60s

- time grain: PT1M (one minute, or 60s)
- collection interval: 60s

In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection.

#### Scenario B: collect PT15M metrics every 60s

- time grain: PT5M (five minutes, or 300s)
- collection interval: 60s

In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections.

#### The jitter

During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling.

### Why

During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections.

The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s).

![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36)

The 1-second difference may happen due to an inaccurate rounding in the reference time.

For example, suppose the following two events occur:

1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection.
2. The timestamp is 2023-12-08T10:58:32.999Z.

In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation.

This problem happened to one test agent. However, if it happens to one agent, it can happen to others.

### Extended Structured Logging

We also added new fields to the debug structured logs:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq
{
  "log.level": "debug",
  "@timestamp": "2024-01-05T15:03:12.235+0100",
  "log.logger": "azure monitor client",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate",
    "file.name": "azure/metric_registry.go",
    "file.line": 80
  },
  "message": "MetricRegistry: Metric needs an update",
  "service.name": "metricbeat",
  "needs_update": true,
  "reference_time": "2024-01-05T14:03:07.197Z",
  "last_collection_time": "2024-01-05T14:02:07.199Z",
  "time_since_last_collection_seconds": 66.035681,
  "time_grain": "PT1M",
  "time_grain_duration_seconds": 60,
  "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm",
  "namespace": "Microsoft.Compute/virtualMachines",
  "aggregation": "Total",
  "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total",
  "ecs.version": "1.6.0"
}
```

Here's an example using `jq`:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r  '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines

.aggregation                            aggregation   .needs_update   .reference_time                 .last_collection_time           time_since_last_collection_seconds .time_grain_duration_seconds .time_grain
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        60.999661                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        61.795341                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        62.080088                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        64.929579                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        65.632209                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        67.832918                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        68.576239                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        69.927988                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.351148                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.872058                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.47401                           60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.971242                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        73.143605                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        74.831489                          60                           PT1M
```

(cherry picked from commit 824dd04)

* Remove extra changelog entry

---------

Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
zmoog added a commit that referenced this issue Jan 6, 2024
### What

Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain.

If the time since the last collection < time grain duration, then the metrics skip the collection.

For example, given the following scenario:

#### Scenario A: collect PT1M metrics every 60s

- time grain: PT1M (one minute, or 60s)
- collection interval: 60s

In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection.

#### Scenario B: collect PT15M metrics every 60s

- time grain: PT5M (five minutes, or 300s)
- collection interval: 60s

In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections.

#### The jitter

During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling.

### Why

During a testing session on 8.11.2, we [noticed](#37204 (comment)) one out of four agents skipped some metrics collections.

The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s).

![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36)

The 1-second difference may happen due to an inaccurate rounding in the reference time.

For example, suppose the following two events occur:

1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection.
2. The timestamp is 2023-12-08T10:58:32.999Z.

In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation.

This problem happened to one test agent. However, if it happens to one agent, it can happen to others.

### Extended Structured Logging

We also added new fields to the debug structured logs:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq
{
  "log.level": "debug",
  "@timestamp": "2024-01-05T15:03:12.235+0100",
  "log.logger": "azure monitor client",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate",
    "file.name": "azure/metric_registry.go",
    "file.line": 80
  },
  "message": "MetricRegistry: Metric needs an update",
  "service.name": "metricbeat",
  "needs_update": true,
  "reference_time": "2024-01-05T14:03:07.197Z",
  "last_collection_time": "2024-01-05T14:02:07.199Z",
  "time_since_last_collection_seconds": 66.035681,
  "time_grain": "PT1M",
  "time_grain_duration_seconds": 60,
  "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm",
  "namespace": "Microsoft.Compute/virtualMachines",
  "aggregation": "Total",
  "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total",
  "ecs.version": "1.6.0"
}
```

Here's an example using `jq`:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r  '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines

.aggregation                            aggregation   .needs_update   .reference_time                 .last_collection_time           time_since_last_collection_seconds .time_grain_duration_seconds .time_grain
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        60.999661                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        61.795341                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        62.080088                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        64.929579                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        65.632209                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        67.832918                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        68.576239                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        69.927988                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.351148                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.872058                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.47401                           60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.971242                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        73.143605                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        74.831489                          60                           PT1M
```

(cherry picked from commit 824dd04)

Co-authored-by: Maurizio Branca <maurizio.branca@elastic.co>
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this issue Feb 5, 2024
### What

Change the `MetricRegistry.NeedsUpdate()` method to decide whether to collect the metrics by comparing the collection interval with the time grain. 

If the time since the last collection < time grain duration, then the metrics skip the collection.

For example, given the following scenario:

#### Scenario A: collect PT1M metrics every 60s

- time grain: PT1M (one minute, or 60s)
- collection interval: 60s

In this case, the time since the last collection is never shorter than the time grain, so the metricset fetch metric values on every collection.

#### Scenario B: collect PT15M metrics every 60s

- time grain: PT5M (five minutes, or 300s)
- collection interval: 60s

In this case, the time since the last collection is shorter (60s, 120s, 180s, 240s) than the time grain for four collections. The metricset fetch metric values every five collections.

#### The jitter

During our tests, we noticed the collection scheduling had some variations, causing the time since the last collection to be shorter than expected by a few milliseconds. To compensate for these scheduling fluctuations, the function also adds a short jitter duration (1 second) to avoid false positives due to small fluctuations in collection scheduling. 

### Why

During a testing session on 8.11.2, we [noticed](elastic#37204 (comment)) one out of four agents skipped some metrics collections.

The debug logs revealed the metricset skipped collections due to a 1-second difference between the reference time for the current and previous collections (299s instead of 300s).

![CleanShot 2023-12-08 at 20 13 19](https://github.com/elastic/beats/assets/25941/dc3d5040-c89b-47d2-a86a-124eb838ca36)

The 1-second difference may happen due to an inaccurate rounding in the reference time.

For example, suppose the following two events occur:

1. Metricbeat calls `Fetch()` on the metricset a few milliseconds earlier than in the previous collection.
2. The timestamp is 2023-12-08T10:58:32.999Z. 

In this case, the reference time becomes 2023-12-08T10:58:32.000Z due to the truncation.

This problem happened to one test agent. However, if it happens to one agent, it can happen to others.

### Extended Structured Logging

We also added new fields to the debug structured logs:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | head -n 1 | jq                                                                                                                                                    
{
  "log.level": "debug",
  "@timestamp": "2024-01-05T15:03:12.235+0100",
  "log.logger": "azure monitor client",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/x-pack/metricbeat/module/azure.(*MetricRegistry).NeedsUpdate",
    "file.name": "azure/metric_registry.go",
    "file.line": 80
  },
  "message": "MetricRegistry: Metric needs an update",
  "service.name": "metricbeat",
  "needs_update": true,
  "reference_time": "2024-01-05T14:03:07.197Z",
  "last_collection_time": "2024-01-05T14:02:07.199Z",
  "time_since_last_collection_seconds": 66.035681,
  "time_grain": "PT1M",
  "time_grain_duration_seconds": 60,
  "resource_id": "/subscriptions/123/resourceGroups/crest-test-lens-migration/providers/Microsoft.Compute/virtualMachines/rajvi-test-vm",
  "namespace": "Microsoft.Compute/virtualMachines",
  "aggregation": "Total",
  "names": "Network In,Network Out,Disk Read Bytes,Disk Write Bytes,Network In Total,Network Out Total",
  "ecs.version": "1.6.0"
}
```

Here's an example using `jq`:

```shell
$ cat metricbeat.log.ndjson | grep "MetricRegistry" | jq -r  '[.namespace, .aggregation, .needs_update, .reference_time, .last_collection_time//"na", .time_since_last_collection_seconds//"na", .time_grain_duration_seconds//"na", .time_grain] | @TSV' | grep Microsoft.Compute/virtualMachines

.aggregation                            aggregation   .needs_update   .reference_time                 .last_collection_time           time_since_last_collection_seconds .time_grain_duration_seconds .time_grain
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        60.999661                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        61.795341                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        62.080088                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        64.929579                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        65.632209                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        67.832918                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        68.576239                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        69.927988                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.351148                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        70.872058                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.47401                           60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        72.971242                          60                           PT1M
Microsoft.Compute/virtualMachines       Average       true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        73.143605                          60                           PT1M
Microsoft.Compute/virtualMachines       Total         true            2024-01-05T14:16:07.193Z        2024-01-05T14:15:07.193Z        74.831489                          60                           PT1M
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
2 participants