Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(proposal): monitoring of custom resources #133

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sebastiangaiser
Copy link

This proposal is related to strimzi/strimzi-kafka-operator#10276.
As this is my first proposal, so please let me know if any changes (also to the format) are needed.

@im-konge @scholzj can you please help me filling the TODO parts.

Thank you for your help and your work on this project.

Signed-off-by: Sebastian Gaiser <sebastiangaiser@users.noreply.github.com>
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal. I tried to answer the TODOs and left some more minor comments. But I think it looks good overall. Thanks.

## Current situation

Strimzi provides metrics for several use cases but currently not for the state of `KafkaTopic` and `KafkaUser` CRs.
TODO need input for the status of other CRs and their metrics, e.g. `Kafka`, `StrimziPodSet`, `KafkaRebalance`...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric name is strimzi_resource_state. It is currently issued only by the Cluster Operator and only for the user level resources (e.g. Kafka, KafkaConnect ... but not for StrimziPodSet resources). If you wanna include some examples here:

$ curl -s localhost:8080/metrics | grep strimzi_resource_state
# HELP strimzi_resource_state Current state of the resource: 1 ready, 0 fail
# TYPE strimzi_resource_state gauge
strimzi_resource_state{kind="Kafka",name="my-cluster",reason="none",resource_namespace="myproject",} 1.0
strimzi_resource_state{kind="KafkaConnect",name="my-connect",reason="none",resource_namespace="myproject",} 1.0

Or as a screenshot:
Screenshot 2024-09-23 at 16 19 56


## Current situation

Strimzi provides metrics for several use cases but currently not for the state of `KafkaTopic` and `KafkaUser` CRs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can expand this a bit?

Suggested change
Strimzi provides metrics for several use cases but currently not for the state of `KafkaTopic` and `KafkaUser` CRs.
All Strimzi operators (Cluster, User and Topic operators) provide metrics in Prometheus format.
However, currently, only the Cluster Operator provides a metric (`strimzi_resource_state`) describing the state of the custom resources (whether they are ready or not).
It provides this metric for the `Kafka`, `KafkaConnect`, `KafkaBridge`, KafkaMirrorMaker`, `KafkaMirrorMaker2`, and `KafkaRebalance` resources.
It does not currently provide this metric for `KafkaNodePool`, `KafkaConnector`, and `StrimziPodSet` custom resources.
The User and Topic Operators currently do not provide this metric for the `KafkaUser` and `KafkaTopic` resources.

## Motivation

Strimzi already provides example configurations being compatible with the prometheus-operator via e.g. `PodMonitor` for scraping the metrics and e.g. `PrometheusRules` for alerting.
This is implemented by metrics-exporter via JMX or kafka-exporter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, as we are talking about operators here, they handle it on their own. So I would probably delete this sentence.

Comment on lines +46 to +49
The only affected projects are:
- CO
- TO
- UO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Cluster Operator is the only affected project -> the others do not provide such a metric so nothing will change there.


The previously elaborated metrics should be deprecated and replaced in favour of KSM based metrics.
This also applies for PrometheusRules which should be replaced.
The proposed way would be implementing the KSM and deprecating the current metrics in Strimzi CO/TO/UO in version XX and removing them in version XZ to give users enough time (2 releases) adjusting their monitoring if needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I would leave 4 releases? 2 are quite a short time period. Also, it should be only the Cluster Operator that gets mentioned here. I would also fill in some expected versions here. Maybe 0.45 for the deprecation and 0.49 for the removal. if the proposal gets approved and implemented earlier, we can simply update it.


## Compatibility

TODO need input about already implemented metrics, e.g. for `Kafka` resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you have in the paragraph below is mostly good. In general:

  • You should make it clear that there is no immediate impact right away and users will have the 4 releases to adjust
  • That after the 4 releases, the strimzi_resource_state metric will be removed from the Cluster Operator completely


So in order to have a monitoring in place for all CRs and implementing them in a scalable and Kubernetes native way KSM can be used.
KSM is (mostly) deployed via a simple Kubernetes `Deployment` and querries the Kubernetes api for e.g. the status object of a Kubernetes resource indicating that the CR having a problem or contains a deprecated configuration.
This can also be done for CRs exclusivly by using the `--custom-resource-state-config` flag which is helpful for e.g. `KafkaTopic`, `KafkaUser`, ... CRs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should clarify that in the initial phase, the support in the examples will be implemented at least for the existing resources for which we publish the existing metric + for KafkaUser and KafkaTopic.

In addition, example `PrometheusRules` should be defined for all CRs for having a problem or uses a deprecated configuration in a simple way, so that a cluster operator can have a look for an error message in the status object of the CR.
[This PR](https://github.com/strimzi/strimzi-kafka-operator/pull/10277) could be used as an implementation idea.

## Affected/not affected projects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you checked if the current metric is used in any Grafana Dashboards? I think it is not, so there should be nothing to worry about, but I did not searched for it really, just basing it on my memory.

Strimzi should provide guidance for existing fields in status object in order to indicate if a CR is having a problem or uses a deprecated configuration in the official documentation.
By this users can extend monitoring of CRs on their own requirements.

In general, an example ConfigMap should be provided in the examples directory of [strimzi/strimzi-kafka-operator](https://github.com/strimzi/strimzi-kafka-operator/tree/main/examples) specifying a good starting configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be more specific here? I guess it will be somewhere as https://github.com/strimzi/strimzi-kafka-operator/tree/main/examples/metrics/kube-state-metrics probably?

Maybe you could include a simple sample of what the config map will look like to give a better idea? E.g. just for a single resource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants