This proposal introduces changes to the connector auto-restart functionality which was introduced in Proposal #7 and implemented in Strimzi 0.33.0.
Strimzi 0.33.0 introduced support for automatically restarting connectors in Apache Kafka Connect clusters (when managed using the KafkaConnector
resources). This feature is disabled by default, but can be enabled in the KafkaConnectors
.spec
section:
apiVersion: strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: my-source-connector
spec:
# ...
autoRestart:
enabled: true
# ...
When enabled, it works like this:
- The connector is restarted up to 7 times
- The times of the auto-restarts are:
- Immediate
- 2+ minutes after previous restart
- 6+ minutes after previous restart
- 12+ minutes …
- 20+ minutes …
- 30+ minutes …
- 42+ minutes …
- There is no exact time, so it will be restarted when the next reconciliation happens and the time condition is fulfilled (i.e. restart after 6+ minutes can happen for example after 7 minutes in the next suitable periodical reconciliation).
- After 7th restart, the connector will not be automatically restarted anymore
- The restart counter (number of restarts already done) is tracked in the
.status
section of theKafkaConnector
custom resource. When the connector runs successfully for at least the backoff interval corresponding to the number of restarts already done, the restart counter is reset to 0. For example, if it was restarted already 4 times, it needs to run successfully for at least 20 minutes for the restart counter to be reset to 0 and the restart sequence start from the beginning in case of the next failure.
The auto-restart functionality works fine. But in some cases, it would be useful to have more flexibility including a possibility to retry the restarts indefinitely. The capability of infinite restarts allows the auto-restart feature to be used for extended outages, such as those spanning an entire weekend, thereby enhancing the value of its functionality.
A new field maxRestarts
will be added to the autoRestart
section of the KafkaConnector
CRD:
apiVersion: strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: my-source-connector
spec:
# ...
autoRestart:
enabled: true
maxRestarts: 20
# ...
The maxRestarts
field will default to null (not set).
And when it is not set (set to null), the operator will attempt to restart the connector infinitely.
When a user sets maxRestarts
to a specific value, the operator will attempt to restart the connector only for a given number of attempts and then give up on it.
In case of failure, the user needs to restart the connector manually, as happens with the current functionality.
The solution proposed by this proposal is not fully backwards compatible. Please read the Compatibility section for more details.
The current timing will be unchanged.
It will be calculated based on the following formula where the restartCount
means the number of restarts that have already happened.
The maximal value of the back-off time will be set to 60 minutes as maximum.
backoff_in_minutes = minimum((restartCount * restartCount) + restartCount, 60)
It means the operator will be restarted at following times:
- Immediate
- 2+ minutes after previous restart
- 6+ minutes after previous restart
- 12+ minutes …
- 20+ minutes …
- 30+ minutes …
- 42+ minutes …
- 56+ minutes …
- 60+ minutes …
- 60+ minutes …
- …
There is no exact time, so it will be restarted when the next reconciliation happens and the time condition is fulfilled (i.e. restart after 6+ minutes can happen for example after 7 minutes in the next suitable periodical reconciliation).
When the connector runs successfully for at least the backoff interval for the restart counter to be reset. E.g. if it was restarted 4 times, it needs to run for 20 minutes for the restart counter to be reset to 0 and the restart sequence start from the beginning in case of the next failure. This is unchanged compared to the current state.
This affects only the Strimzi Kafka Operators repository and its Cluster Operator.
This proposal maintains a full API (= the Kubernetes CRDs) compatibility.
While the API is fully backwards compatible, the semantics of how the API is used and how the auto-restart works is different. In the previous versions, a Kafka Connector with the auto-restart enabled would see it restart up to 7 times after it fails:
apiVersion: strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: my-source-connector
spec:
# ...
autoRestart:
enabled: true
# ...
But the same connector after this proposal is introduced would be restarting infinitely.
Existing users who might want to maintain the original behavior would need to change their KafkaConnector
custom resources and add the maxRestarts: 7
option :
apiVersion: strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: my-source-connector
spec:
# ...
autoRestart:
enabled: true
maxRestarts: 7
# ...
The infinite restarts will be happening at the maximal back-off interval of 60 minutes. So it should cause only a minimal disruption to users who have this activated by mistake.
While this change is not backwards compatible, it gives us a clean API for the long term future, where the maxRestarts
field would have no defaults and the following configuration would mean infinite restarts.
autoRestart:
enabled: true
This API is easier to understand, read, and provides better user experience.
The alternative would be to have the maxRestarts
field default to 7.
That might be confusing for new users who start using Strimzi later and might not expect the restart limit to be 7 restarts when nothing is configured.
The maxRestarts
field will be added as described in this proposal.
But its default value will be 7
.
Thanks to that, existing users will be not affected by this change and this change will provide both API as well as semantic backwards compatibility.
However:
- The default value of
7
will stay in Strimzi forever and might not provide the best user experience in the long term. - The infinite restarts would need to be represented by setting
maxRestarts
to some special value (for example0
) or by setting it to some large number such as1000000
which again is not the most user-friendly way in the long term.