Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd: add flag for checking member data consistency #2676

Open
3 of 6 tasks
neolit123 opened this issue Mar 28, 2022 · 11 comments
Open
3 of 6 tasks

etcd: add flag for checking member data consistency #2676

neolit123 opened this issue Mar 28, 2022 · 11 comments
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@neolit123
Copy link
Member

neolit123 commented Mar 28, 2022

xref etcd-io/etcd#13766

announcement of the bug's impact at k-dev:
https://groups.google.com/a/kubernetes.io/d/msgid/dev/CAJs3Yt0WKgyUFL%3D1V13ojZPvV9_Qsa8WXLeohUxhRjEK07P25g%40mail.gmail.com?utm_medium=email&utm_source=footer

With recent reproduction of data inconsistency issues in #13766, etcd maintainers are no longer recommending v3.5 releases for production. In our testing we have found that if the etcd process is killed under high load, occasionally some committed transactions are not reflected on all the members. The problem affects versions v3.5.0, v3.5.1, v3.5.2.

(note: the bug in question in relatively rare)

add the etcd flag --experimental-initial-corrupt-check to all kubeadm versions that use etcd 3.5.x by default.
1.24 (master), 1.23, 1.22.

--experimental-initial-corrupt-check 'false'
  Enable to check data corruption before serving any client/peer traffic.

1.24 timeframe

add the --experimental-initial-corrupt-check flag to versions of kubeadm that use etcd 3.5.[0-2]

1.25 timeframe

backport a 3.5.3+ bump to kubeadm versions in support.
aligns with #2567 too.

** future work **

etcd 3.6 (or 4.0?) may graduate the flag / feature to --initial-corrupt-check, thus once we upgrade kubeadm to that etcd version we should switch to the new flag as well.

@neolit123 neolit123 added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. kind/feature Categorizes issue or PR as related to a new feature. area/etcd labels Mar 28, 2022
@neolit123 neolit123 added this to the v1.24 milestone Mar 28, 2022
@neolit123 neolit123 self-assigned this Mar 28, 2022
@neolit123 neolit123 added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2022
@neolit123 neolit123 modified the milestones: v1.24, v1.25 Mar 29, 2022
smira added a commit to smira/talos that referenced this issue Mar 29, 2022
See:

- etcd-io/etcd#13766
- kubernetes/kubeadm#2676

Signed-off-by: Andrey Smirnov <andrey.smirnov@talos-systems.com>
@neolit123
Copy link
Member Author

neolit123 commented Mar 30, 2022

Looks like that the flag check doesn't always work but that is the better option until 3.5.3
etcd-io/etcd#13766 (comment)

@pacoxu
Copy link
Member

pacoxu commented Apr 14, 2022

kubernetes/kubernetes#109470 was merged for etcd v3.5.3.

@neolit123
Copy link
Member Author

neolit123 commented Apr 14, 2022

we can backport kubernetes/kubernetes#109471 to 1.21, 1.22, 1.23
i think we can backport the whole PR, if not we can backport only the kubeadm parts.

kubernetes/kubernetes#109532
kubernetes/kubernetes#109533

@neolit123 neolit123 modified the milestones: v1.25, v1.26 Aug 25, 2022
@neolit123 neolit123 modified the milestones: v1.26, v1.27 Nov 21, 2022
@pacoxu
Copy link
Member

pacoxu commented Dec 23, 2022

The current work is done. There may be some work after we upgrade to etcd 4.0(maybe a long time later).

See more roadmap at etcd-io/etcd#9190.

@neolit123
Copy link
Member Author

etcd 4.0 Plan

ack, thanks for the info.

@neolit123 neolit123 modified the milestones: v1.27, v1.28 Apr 17, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2023
@neolit123 neolit123 removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2023
@SataQiu
Copy link
Member

SataQiu commented Jul 16, 2023

/remove-lifecycle stale

@neolit123 neolit123 modified the milestones: v1.28, v1.29 Jul 21, 2023
@neolit123 neolit123 modified the milestones: v1.29, v1.30 Nov 1, 2023
@neolit123 neolit123 removed their assignment Nov 8, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 6, 2024
@neolit123
Copy link
Member Author

neolit123 commented Feb 6, 2024

etcd 3.6 may graduate the flag / feature to --initial-corrupt-check, thus once we upgrade kubeadm to that etcd version we should switch to the new flag as well.

FWIW, 3.5.11 (default in kubeadm 1.30 currently) does not have the non-experimenta flag yet:

flag provided but not defined: -initial-corrupt-check

@neolit123 neolit123 removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 6, 2024
@neolit123 neolit123 modified the milestones: v1.30, v1.31 Apr 5, 2024
@neolit123 neolit123 modified the milestones: v1.31, v1.32 Aug 7, 2024
@pacoxu
Copy link
Member

pacoxu commented Oct 22, 2024

etcd-io/etcd#18478
KEP-4578: migrate experimental-initial-corrupt-check flag to feature gate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

5 participants