Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SPSMDB-1014: update cert-manager certs and issuers #1383

Merged
merged 26 commits into from
Apr 23, 2024
Merged

Conversation

pooknull
Copy link
Contributor

@pooknull pooknull commented Nov 27, 2023

K8SPSMDB-1014 Powered by Pull Request Badge

https://jira.percona.com/browse/K8SPSMDB-1014

DESCRIPTION

Problem:
After the update from crVersion 1.14.0 to 1.15.0, after certificate renewal, the operator is stuck failing when .spec.updateStrategy is set to SmartUpdate.

When updateStrategy is set to SmartUpdate and the cluster is updated from version 1.14.0 to 1.15.0, after the certificate renewal cluster is stuck on smart update.

Cause:
In version 1.15.0 we switched to the new certificate schema. For more info check the description of this PR: #1287. In this PR we didn't implement the update to the new certificate schema.

Certificates are not updated and we will still have the same problem we had in https://jira.percona.com/browse/K8SPSMDB-956.

Solution:
First of all, the operator should update the certificates. To do that, we should check if the cert-manager is installed. If it is, we should try to apply our changes.

After the changes, the operator will still face issues with smartUpdate, so it is recommended to create a migration mechanism as described in this guide if there are any changes made to the CA.: https://docs.percona.com/percona-operator-for-mongodb/TLS.html#update-certificates-without-downtime.

So, the migration will consist of the following actions:

  1. Check if the cert-manager exists.
  2. If true, check if any changes will be applied to the certificates.
  3. If true, then we should create copies of cluster1-ssl and cluster1-ssl-internal secrets named cluster1-ssl-old and cluster1-ssl-internal-old.
  4. Apply the changes to the certificates and wait for new secrets.
  5. Get ca.crt from both old secrets and merge them into new secrets. Set values of tls.key and tls.crt from old secrets to the new ones.
  6. Wait until the next reconcile.
  7. On the next reconcile, we will check if any changes will be applied to the certificates.
  8. If certificates remain untouched, the operator will check if ca.crt was merged from old secrets.
  9. If true, it will delete old secrets.
  10. Wait until all statefulsets are ready.
  11. Compare the ca.crt of current secrets with the ca.crt from cluster1-ca-cert.
  12. If it's different, set percona.com/update-mongos-first annotation to the cluster and recreate the secrets by deleting them. Cert-manager will recreate them.

The percona.com/update-mongos-first annotation has been added to force the next smart update to be applied to mongos before mongod.
This is necessary because mongos pods only accept the first part of the CA. After the secret recreation, all mongod pods will have the last part of the CA, and mongos won't be able to connect to them. So we should update the mongos pods before the mongod pods.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are the manifests (crd/bundle) regenerated if needed?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/L 100-499 lines label Nov 27, 2023
@pull-request-size pull-request-size bot added size/XL 500-999 lines and removed size/L 100-499 lines labels Dec 6, 2023
@pull-request-size pull-request-size bot added size/XXL 1000+ lines and removed size/XL 500-999 lines labels Mar 28, 2024
@pooknull pooknull marked this pull request as ready for review March 28, 2024 16:48
tplavcic
tplavcic previously approved these changes Apr 3, 2024
nmarukovich
nmarukovich previously approved these changes Apr 10, 2024
@pooknull pooknull dismissed stale reviews from nmarukovich and tplavcic via e9699d5 April 12, 2024 14:05
inelpandzic
inelpandzic previously approved these changes Apr 16, 2024
egegunes
egegunes previously approved these changes Apr 16, 2024
@egegunes egegunes added this to the v1.16.0 milestone Apr 17, 2024
@pooknull pooknull dismissed stale reviews from egegunes and inelpandzic via e225c56 April 18, 2024 12:22
@pooknull
Copy link
Contributor Author

It seems that the https://docs.percona.com/percona-operator-for-mongodb/TLS.html#update-certificates-without-downtime approach doesn't work with mongos.

After the final recreation of secrets (step 12), the operator updates the cfg pods with new secrets. After all cfg pods have been updated, all mongos pods become unready with the following error in the logs:

{"t":{"$date":"2024-04-22T07:35:34.003+00:00"},"s":"W","c":"NETWORK","id":23235,"ctx":"conn2449","msg":"SSL peer certificate validation failed","attr":{"reason":"self-signed certificate"}}

This is why I removed lines in this discussion: #1383 (comment). We shouldn't remove them. But we also need to find a way to update mongos correctly.

My guess is that mongos only accepts the first part of the CA.

@pooknull
Copy link
Contributor Author

The issue mentioned here: #1383 (comment) has been fixed in 8264390

Description has been updated.

egegunes
egegunes previously approved these changes Apr 22, 2024
@JNKPercona
Copy link
Collaborator

Test name Status
arbiter passed
balancer passed
custom-replset-name passed
cross-site-sharded passed
data-at-rest-encryption passed
data-sharded passed
demand-backup passed
demand-backup-eks-credentials passed
demand-backup-physical passed
demand-backup-physical-sharded passed
demand-backup-sharded passed
expose-sharded passed
ignore-labels-annotations passed
init-deploy passed
finalizer passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 passed
multi-cluster-service passed
non-voting passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-sharded passed
pitr-physical passed
pvc-resize passed
recover-no-primary passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls passed
upgrade-sharded passed
users passed
version-service passed
We run 48 out of 48

commit: f7f2d8d
image: perconalab/percona-server-mongodb-operator:PR-1383-f7f2d8d2

@hors hors merged commit 282394a into main Apr 23, 2024
14 checks passed
@hors hors deleted the dev/K8SPSMDB-1014 branch April 23, 2024 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL 1000+ lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants