Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telemetry: periodic reporting #2721

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions kubernetes/customresourcedefinitions.gen.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 22 additions & 0 deletions proto.lock
Original file line number Diff line number Diff line change
Expand Up @@ -45106,6 +45106,28 @@
"name": "access_logging",
"type": "AccessLogging",
"is_repeated": true
},
{
"id": 5,
"name": "reporting_interval",
"type": "ReportingInterval"
}
],
"messages": [
{
"name": "ReportingInterval",
"fields": [
{
"id": 1,
"name": "tcp",
"type": "google.protobuf.Duration"
},
{
"id": 2,
"name": "http",
"type": "google.protobuf.Duration"
}
]
}
]
},
Expand Down
600 changes: 351 additions & 249 deletions telemetry/v1alpha1/telemetry.pb.go

Large diffs are not rendered by default.

60 changes: 58 additions & 2 deletions telemetry/v1alpha1/telemetry.pb.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 18 additions & 0 deletions telemetry/v1alpha1/telemetry.proto
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,23 @@ message Telemetry {
// Optional. AccessLogging configures the access logging behavior for all
// selected workloads.
repeated AccessLogging access_logging = 4;

// Reporting interval allows configuration of the time between the access log
// reports for the individual streams. Duration set to 0 disables the perodic
// reporting, and only the ends of the streams are reported.
message ReportingInterval {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of AccessLogging? IIUC, this config is only applicable for access logs. If access logs is not enabled, this config would not be useful, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, worth to explain why user would want to customize this for what type of workloads. And why they would consider disable it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have a discussion about what we plan to to about OTel and all the standards and integrations they are driving.

https://opentelemetry.io/docs/reference/specification/sdk-environment-variables/ has the set of options OTel supports - OTEL_METRIC_EXPORT_INTERVAL is the equivalent setting that an app using otel natively would use.

In otel it applies to all push-based metric exporters - logs and tracers don't seem to have this ( but have batch and other settings ).

IMO we should consider the API surface of OTel and not try to have more complicated settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linsun AccessLogging is a list, and the setting is global in Envoy. Hence I pulled it to the top. Users shouldn't generally touch this unless they want to disable it (because they want to limit telemetry). It cannot be made Otel specific, since the log production is driven by Envoy and the push is driven by OTel.

@costinm That env seems unrelated - it's about metric push, not production again. See the discussion about IO flush - they are not related at all.

The closest semantic convention I could find is modeling stream events in Otel https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/semantic_conventions/events.md. We'd define an event.name for stream open/mid/close, and report the generic attributes associated with the stream.

// Optional. Reporting interval for TCP streams. The default value is 5s.
google.protobuf.Duration tcp = 1;

// Optional. Reporting interval for HTTP streams. Not enabled by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both http 1 and 2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this would be here or why it matters what protocol is used. You want 30 seconds - using whatever protocol is available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any HTTP, Envoy treats them the same.

Specialization by protocol is for backwards compatibility only - Istio does periodic reporting for TCP only right now.

30s is too long - users consider 10s extra latency unacceptable and without any telemetry during that time, we can't debug.

google.protobuf.Duration http = 2;
}

// Configuration for the interval reporting for the access logging and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to "report" and access log at a reporting_interval: 5s? Does that mean if I open a connection and send no data, it will report and access log every 5s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what Istio does today?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know access logs are only printed once per connection...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the demo stdout log. The production stackdriver log reports periodically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we note, in the proto, this? Stdout is not seen as "demo" as well, its pretty commonly used in production (https://12factor.net/logs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any vendor rely on stdout for production? It's demo because you need another system to collect logs and we don't include it in Istio.

The point of this PR is to generalize what we do for stackdriver as a good production practive to every other logging sink including Otel and stdout. Is that what you want me to say?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any vendor rely on stdout for production?

Yes, we do 🙂

I want the PR to explain what the API does, and give guidance to a user on how they can configure it, which is the standard for API documentation.

I am still not sure I even understand. "including Otel and stdout" - I thought we were saying this does not apply to stdout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google doesn't depend on stdout logging, not sure why you claim so. It shouldn't since stdout will hurt performance.

The setting is global - it applies to all Envoy telemetry sinks, including stdout and otel. SD implementation is an odd ball but it's Google's problem and will be fixed later.

The API is about enabling periodic reporting for long living streams, and it's opt-in. What is not clear about it? Are you questioning the value of the periodic reporting or the shape of the API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends - many apps report logs to stdout and this gets collected and pushed to stackdriver.

I think we need to be compatible ( and adopt ) OTel, including integration with their collectors if available - and not expose APIs that would not be implementable or needed.

Maybe in mesh config for the other integrations - but even for that I doubt we should expose them.

All APIs require testing, support and are hard to deprecated.

// metrics. Telemetry is reported independently per each data stream, at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for metrics also? Earlier only access logs were mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some metric sinks are implemented as access loggers. It's deliberately vague since it's implementation-dependent.

// the specified rate. This setting is independent of the flush rate of the
// telemetry sinks.
ReportingInterval reporting_interval = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in top level or inside each struct? You may want different settings for each. Stdout logs we set flush interval to 1s for example, but that would be aggressive for a write to a remote server.

Does this overwrite --file-flush-interval-msec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not supported. Most people have only one sink enabled (telemetry costs $$$) and we should cater to them and not complicate things for imaginary use cases.

This setting has nothing to do with IO flush.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be misunderstanding periodic reporting. It's reporting per data stream, not per sink stream

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am misunderstanding the surely our users will 🙂 . Can we clarify the comments a bit so it's understandable for someone that doesn't have prior context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if it's not related to IO flush that seems like something to be called out in the comments. I guess that means it I use Telemetry to turn on stdout logging and set this field it has no impact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned that the IO flush rate is completely independent. Istio reports telemetry per stream, not per sink.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the .proto file I still have no clue what this does

  // Reporting interval allows configuration of the time between the access log
  // reports for the individual streams.

But you are saying its NOT related to the IO flush rate, which is, by my understanding, the interval at which access logs are written to stdout?

In Envoy, is this config global or something? What happens if I have 2 metrics backends configured by 2 Telemetry with 2 different configs? Or 2 telemetry, 1 configuring metrics and 1 configuring access logs.

TBH I have read this proto quite a few times and I don't think I am any closer to understanding what it does, and how or why I should set it. Can we write the comments from the perspective of a user to help guide them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a new feature - this existed in Istio since 1.0 for TCP. IO flush rate has nothing to do with telemetry production - it's a rate of delivering telemetry not production. Users never complained about this being confusing :/

In Envoy, is this config global or something? What happens if I have 2 metrics backends configured by 2 Telemetry with 2 different configs? Or 2 telemetry, 1 configuring metrics and 1 configuring access logs.

It is a global setting per listener.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a list of telemetry providers - and how many can implement this feature ? From what I've seen, if we use OTel this will not be configurable per app - but the per-node or cluster collector will handle most of this.

For metric push - I see OTel has it so I don't mind adding it, but for the others I would not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a list of telemetry providers - and how many can implement this feature ? From what I've seen, if we use OTel this will not be configurable per app - but the per-node or cluster collector will handle most of this.

I think all of them would be supported, but they need to be modified to properly report mid-stream telemetry. OTel would need a special field to indicate whether it's end of stream or mid-stream.

For metric push - I see OTel has it so I don't mind adding it, but for the others I would not.

Metrics are somewhat irrelevant actually, it's an artifact of Istio quirky stackdriver/prometheus.
Standard envoy metrics report immediately, and Istio metric sinks act like access loggers, so they get snapshot of the data at the same rate as this value.

}

// Tracing configures tracing behavior for workloads within a mesh.
Expand Down Expand Up @@ -401,6 +418,7 @@ message Metrics {
// Optional. Reporting interval allows configuration of the time between calls out to for metrics reporting.
// This currently only supports TCP metrics but we may use this for long duration HTTP streams in the future.
// The default duration is `5s`.
// DEPRECATED. Please use Telemetry.reporting_interval.tcp field instead.
google.protobuf.Duration reporting_interval = 3;
}

Expand Down
21 changes: 21 additions & 0 deletions telemetry/v1alpha1/telemetry_deepcopy.gen.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 11 additions & 0 deletions telemetry/v1alpha1/telemetry_json.gen.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.