Skip to content

Latest commit

 

History

History
161 lines (126 loc) · 6.76 KB

055-infinite-auto-restart-of-Kafka-connectors.md

File metadata and controls

161 lines (126 loc) · 6.76 KB

Infinite auto-restart of Apache Kafka connectors

This proposal introduces changes to the connector auto-restart functionality which was introduced in Proposal #7 and implemented in Strimzi 0.33.0.

Current situation

Strimzi 0.33.0 introduced support for automatically restarting connectors in Apache Kafka Connect clusters (when managed using the KafkaConnector resources). This feature is disabled by default, but can be enabled in the KafkaConnectors .spec section:

apiVersion: strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: my-source-connector
spec:
  # ...
  autoRestart:
    enabled: true
  # ...

When enabled, it works like this:

  • The connector is restarted up to 7 times
  • The times of the auto-restarts are:
    • Immediate
    • 2+ minutes after previous restart
    • 6+ minutes after previous restart
    • 12+ minutes …
    • 20+ minutes …
    • 30+ minutes …
    • 42+ minutes …
    • There is no exact time, so it will be restarted when the next reconciliation happens and the time condition is fulfilled (i.e. restart after 6+ minutes can happen for example after 7 minutes in the next suitable periodical reconciliation).
  • After 7th restart, the connector will not be automatically restarted anymore
  • The restart counter (number of restarts already done) is tracked in the .status section of the KafkaConnector custom resource. When the connector runs successfully for at least the backoff interval corresponding to the number of restarts already done, the restart counter is reset to 0. For example, if it was restarted already 4 times, it needs to run successfully for at least 20 minutes for the restart counter to be reset to 0 and the restart sequence start from the beginning in case of the next failure.

Motivation

The auto-restart functionality works fine. But in some cases, it would be useful to have more flexibility including a possibility to retry the restarts indefinitely. The capability of infinite restarts allows the auto-restart feature to be used for extended outages, such as those spanning an entire weekend, thereby enhancing the value of its functionality.

Proposal

A new field maxRestarts will be added to the autoRestart section of the KafkaConnector CRD:

apiVersion: strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: my-source-connector
spec:
  # ...
  autoRestart:
    enabled: true
    maxRestarts: 20
  # ...

The maxRestarts field will default to null (not set). And when it is not set (set to null), the operator will attempt to restart the connector infinitely. When a user sets maxRestarts to a specific value, the operator will attempt to restart the connector only for a given number of attempts and then give up on it. In case of failure, the user needs to restart the connector manually, as happens with the current functionality.

The solution proposed by this proposal is not fully backwards compatible. Please read the Compatibility section for more details.

The current timing will be unchanged. It will be calculated based on the following formula where the restartCount means the number of restarts that have already happened. The maximal value of the back-off time will be set to 60 minutes as maximum.

backoff_in_minutes = minimum((restartCount * restartCount) + restartCount, 60)

It means the operator will be restarted at following times:

  • Immediate
  • 2+ minutes after previous restart
  • 6+ minutes after previous restart
  • 12+ minutes …
  • 20+ minutes …
  • 30+ minutes …
  • 42+ minutes …
  • 56+ minutes …
  • 60+ minutes …
  • 60+ minutes …

There is no exact time, so it will be restarted when the next reconciliation happens and the time condition is fulfilled (i.e. restart after 6+ minutes can happen for example after 7 minutes in the next suitable periodical reconciliation).

When the connector runs successfully for at least the backoff interval for the restart counter to be reset. E.g. if it was restarted 4 times, it needs to run for 20 minutes for the restart counter to be reset to 0 and the restart sequence start from the beginning in case of the next failure. This is unchanged compared to the current state.

Affected/not affected projects

This affects only the Strimzi Kafka Operators repository and its Cluster Operator.

Compatibility

API Compatibility

This proposal maintains a full API (= the Kubernetes CRDs) compatibility.

Changes to the semantics

While the API is fully backwards compatible, the semantics of how the API is used and how the auto-restart works is different. In the previous versions, a Kafka Connector with the auto-restart enabled would see it restart up to 7 times after it fails:

apiVersion: strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: my-source-connector
spec:
  # ...
  autoRestart:
    enabled: true
  # ...

But the same connector after this proposal is introduced would be restarting infinitely. Existing users who might want to maintain the original behavior would need to change their KafkaConnector custom resources and add the maxRestarts: 7 option :

apiVersion: strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: my-source-connector
spec:
  # ...
  autoRestart:
    enabled: true
    maxRestarts: 7
  # ...

The infinite restarts will be happening at the maximal back-off interval of 60 minutes. So it should cause only a minimal disruption to users who have this activated by mistake.

While this change is not backwards compatible, it gives us a clean API for the long term future, where the maxRestarts field would have no defaults and the following configuration would mean infinite restarts.

  autoRestart:
    enabled: true

This API is easier to understand, read, and provides better user experience.

The alternative would be to have the maxRestarts field default to 7. That might be confusing for new users who start using Strimzi later and might not expect the restart limit to be 7 restarts when nothing is configured.

Rejected alternatives

Defaulting to 7 restarts

The maxRestarts field will be added as described in this proposal. But its default value will be 7. Thanks to that, existing users will be not affected by this change and this change will provide both API as well as semantic backwards compatibility. However:

  • The default value of 7 will stay in Strimzi forever and might not provide the best user experience in the long term.
  • The infinite restarts would need to be represented by setting maxRestarts to some special value (for example 0) or by setting it to some large number such as 1000000 which again is not the most user-friendly way in the long term.