-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(proposal): monitoring of custom resources #133
base: main
Are you sure you want to change the base?
docs(proposal): monitoring of custom resources #133
Conversation
Signed-off-by: Sebastian Gaiser <sebastiangaiser@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal. I tried to answer the TODOs and left some more minor comments. But I think it looks good overall. Thanks.
## Current situation | ||
|
||
Strimzi provides metrics for several use cases but currently not for the state of `KafkaTopic` and `KafkaUser` CRs. | ||
TODO need input for the status of other CRs and their metrics, e.g. `Kafka`, `StrimziPodSet`, `KafkaRebalance`... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric name is strimzi_resource_state
. It is currently issued only by the Cluster Operator and only for the user level resources (e.g. Kafka, KafkaConnect ... but not for StrimziPodSet resources). If you wanna include some examples here:
$ curl -s localhost:8080/metrics | grep strimzi_resource_state
# HELP strimzi_resource_state Current state of the resource: 1 ready, 0 fail
# TYPE strimzi_resource_state gauge
strimzi_resource_state{kind="Kafka",name="my-cluster",reason="none",resource_namespace="myproject",} 1.0
strimzi_resource_state{kind="KafkaConnect",name="my-connect",reason="none",resource_namespace="myproject",} 1.0
|
||
## Current situation | ||
|
||
Strimzi provides metrics for several use cases but currently not for the state of `KafkaTopic` and `KafkaUser` CRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can expand this a bit?
Strimzi provides metrics for several use cases but currently not for the state of `KafkaTopic` and `KafkaUser` CRs. | |
All Strimzi operators (Cluster, User and Topic operators) provide metrics in Prometheus format. | |
However, currently, only the Cluster Operator provides a metric (`strimzi_resource_state`) describing the state of the custom resources (whether they are ready or not). | |
It provides this metric for the `Kafka`, `KafkaConnect`, `KafkaBridge`, KafkaMirrorMaker`, `KafkaMirrorMaker2`, and `KafkaRebalance` resources. | |
It does not currently provide this metric for `KafkaNodePool`, `KafkaConnector`, and `StrimziPodSet` custom resources. | |
The User and Topic Operators currently do not provide this metric for the `KafkaUser` and `KafkaTopic` resources. |
## Motivation | ||
|
||
Strimzi already provides example configurations being compatible with the prometheus-operator via e.g. `PodMonitor` for scraping the metrics and e.g. `PrometheusRules` for alerting. | ||
This is implemented by metrics-exporter via JMX or kafka-exporter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, as we are talking about operators here, they handle it on their own. So I would probably delete this sentence.
The only affected projects are: | ||
- CO | ||
- TO | ||
- UO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Cluster Operator is the only affected project -> the others do not provide such a metric so nothing will change there.
|
||
The previously elaborated metrics should be deprecated and replaced in favour of KSM based metrics. | ||
This also applies for PrometheusRules which should be replaced. | ||
The proposed way would be implementing the KSM and deprecating the current metrics in Strimzi CO/TO/UO in version XX and removing them in version XZ to give users enough time (2 releases) adjusting their monitoring if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I would leave 4 releases? 2 are quite a short time period. Also, it should be only the Cluster Operator that gets mentioned here. I would also fill in some expected versions here. Maybe 0.45 for the deprecation and 0.49 for the removal. if the proposal gets approved and implemented earlier, we can simply update it.
|
||
## Compatibility | ||
|
||
TODO need input about already implemented metrics, e.g. for `Kafka` resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you have in the paragraph below is mostly good. In general:
- You should make it clear that there is no immediate impact right away and users will have the 4 releases to adjust
- That after the 4 releases, the
strimzi_resource_state
metric will be removed from the Cluster Operator completely
|
||
So in order to have a monitoring in place for all CRs and implementing them in a scalable and Kubernetes native way KSM can be used. | ||
KSM is (mostly) deployed via a simple Kubernetes `Deployment` and querries the Kubernetes api for e.g. the status object of a Kubernetes resource indicating that the CR having a problem or contains a deprecated configuration. | ||
This can also be done for CRs exclusivly by using the `--custom-resource-state-config` flag which is helpful for e.g. `KafkaTopic`, `KafkaUser`, ... CRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should clarify that in the initial phase, the support in the examples will be implemented at least for the existing resources for which we publish the existing metric + for KafkaUser
and KafkaTopic
.
In addition, example `PrometheusRules` should be defined for all CRs for having a problem or uses a deprecated configuration in a simple way, so that a cluster operator can have a look for an error message in the status object of the CR. | ||
[This PR](https://github.com/strimzi/strimzi-kafka-operator/pull/10277) could be used as an implementation idea. | ||
|
||
## Affected/not affected projects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you checked if the current metric is used in any Grafana Dashboards? I think it is not, so there should be nothing to worry about, but I did not searched for it really, just basing it on my memory.
Strimzi should provide guidance for existing fields in status object in order to indicate if a CR is having a problem or uses a deprecated configuration in the official documentation. | ||
By this users can extend monitoring of CRs on their own requirements. | ||
|
||
In general, an example ConfigMap should be provided in the examples directory of [strimzi/strimzi-kafka-operator](https://github.com/strimzi/strimzi-kafka-operator/tree/main/examples) specifying a good starting configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be more specific here? I guess it will be somewhere as https://github.com/strimzi/strimzi-kafka-operator/tree/main/examples/metrics/kube-state-metrics
probably?
Maybe you could include a simple sample of what the config map will look like to give a better idea? E.g. just for a single resource.
This proposal is related to strimzi/strimzi-kafka-operator#10276.
As this is my first proposal, so please let me know if any changes (also to the format) are needed.
@im-konge @scholzj can you please help me filling the TODO parts.
Thank you for your help and your work on this project.