diff --git a/docs/modules/ROOT/pages/how-tos/network/migrate-to-cilium.adoc b/docs/modules/ROOT/pages/how-tos/network/migrate-to-cilium.adoc new file mode 100644 index 00000000..45232078 --- /dev/null +++ b/docs/modules/ROOT/pages/how-tos/network/migrate-to-cilium.adoc @@ -0,0 +1,264 @@ += Migrate to Cilium CNI + +== Prerequisites + +* `cluster-admin` privileges +* `kubectl` +* `jq` +* `curl` +* Working `commodore` command + +== Prepare for migration + +IMPORTANT: Make sure that your `$KUBECONFIG` points to the cluster you want to migrate before starting. + +:duration: +120 minutes +include::partial$create-alertmanager-silence-all-projectsyn.adoc[] + +. Select cluster ++ +[source,bash] +---- +export CLUSTER_ID=c-cluster-id-1234 <1> +export COMMODORE_API_URL=https://api.syn.vshn.net <2> +export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" \ + "${COMMODORE_API_URL}/clusters/${CLUSTER_ID}" | jq -r '.tenant') +---- +<1> Replace with the Project Syn cluster ID of the cluster to migrate +<2> Replace with the Lieutenant API on which the cluster is registered + +. Disable ArgoCD auto sync for components `openshift4-nodes` and `openshift-upgrade-controller` ++ +[source,bash] +---- +kubectl --as=cluster-admin -n syn patch apps root --type=json \ + -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' +kubectl --as=cluster-admin -n syn patch apps openshift4-nodes --type=json \ + -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' +kubectl --as=cluster-admin -n syn patch apps openshift-upgrade-controller --type=json \ + -p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]' +---- + +. Disable the cluster-network-operator. +This is necessary to ensure that we can migrate to Cilium without the cluster-network-operator trying to interfere. +We also need to scale down the upgrade controller, so that we can patch the `ClusterVersion` object. ++ +[source,bash] +---- +kubectl --as=cluster-admin -n appuio-openshift-upgrade-controller \ + scale deployment openshift-upgrade-controller-controller-manager --replicas=0 +---- ++ +[source,bash] +---- +kubectl --as=cluster-admin patch clusterversion version \ + --type=merge \ + -p ' + {"spec":{"overrides":[ + { + "kind": "Deployment", + "group": "apps", + "name": "network-operator", + "namespace": "openshift-network-operator", + "unmanaged": true + } + ]}}' +---- ++ +[source,bash] +---- +kubectl --as=cluster-admin -n openshift-network-operator \ + scale deploy network-operator --replicas=0 +---- + +. Verify that the network operator has been scaled down. ++ +[source,bash] +---- +kubectl -n openshift-network-operator get pods <1> +---- +<1> This should return `No resources found in openshift-network-operator namespace`. ++ +[TIP] +==== +If the operator is still running, check the following conditions: + +* The APPUiO OpenShift upgrade controller must be scaled down. +* The `ClusterVersion` object must have an override to make the network operator deployment unmanaged. +==== + +. Remove network operator applied state ++ +[source,bash] +---- +kubectl --as=cluster-admin -n openshift-network-operator \ + delete configmap applied-cluster +---- + +. Pause all machine config pools ++ +[source,bash] +---- +for mcp in $(kubectl get mcp -o name); do +kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec": {"paused": true}}' +done +---- + +== Migrate to Cilium + +. Get local cluster working directory ++ +[source,bash] +---- +commodore catalog compile "$CLUSTER_ID" <1> +---- +<1> We recommend switching to an empty directory to run this command. +Alternatively, switch to your existing directory for the cluster. + +. Enable component `cilium` ++ +[source,bash] +---- +pushd inventory/classes/"${TENANT_ID}" +yq -i '.applications += "cilium"' "${CLUSTER_ID}.yml" +---- + +. Update `upstreamRules` for monitoring ++ +[source,bash] +---- +yq -i ".parameters.openshift4_monitoring.upstreamRules.networkPlugin = \"cilium\"" \ + "${CLUSTER_ID}.yml" +---- + +. Update component `networkpolicy` config ++ +[source,bash] +---- +yq eval -i '.parameters.networkpolicy.networkPlugin = "cilium"' \ + "${CLUSTER_ID}.yml" +yq eval -i '.parameters.networkpolicy.ignoredNamespaces = ["openshift-oauth-apiserver"]' \ + "${CLUSTER_ID}.yml" +---- + +. Configure component `cilium`. +We explicitly configure the K8s API endpoint to ensure that the Cilium operator doesn't access the API through the cluster network. ++ +TIP: When running Cilium with `kubeProxyReplacement=partial`, the API endpoint configuration can be removed after the migration is completed. ++ +.Explicitly configure the K8s API endpoint +[source,bash] +---- +yq -i '.parameters.cilium.cilium_helm_values.k8sServiceHost="api-int.${openshift:baseDomain}"' \ + "${CLUSTER_ID}.yml" <1> +yq -i '.parameters.cilium.cilium_helm_values.k8sServicePort="6443"' \ + "${CLUSTER_ID}.yml" +---- +<1> On vSphere clusters, you may need to use `api.${openshift:baseDomain}`. ++ +.Configure the cluster Pod and Service CIDRs +[source,bash] +---- +POD_CIDR=$(kubectl get network.config cluster \ + -o jsonpath='{.spec.clusterNetwork[0].cidr}') +HOST_PREFIX=$(kubectl get network.config cluster \ + -o jsonpath='{.spec.clusterNetwork[0].hostPrefix}') + +yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.clusterPoolIPv4MaskSize = "'"${HOST_PREFIX}"'"' \ + "${CLUSTER_ID}.yml" +yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.clusterPoolIPv4PodCIDR = "'"${POD_CIDR}"'"' \ + "${CLUSTER_ID}.yml" +---- + +. Commit changes ++ +[source,bash] +---- +git commit -am "Migrate ${CLUSTER_ID} to Cilium" +git push origin master +popd +---- + +. Compile catalog ++ +[source,yaml] +---- +commodore catalog compile "${CLUSTER_ID}" +---- + +. Patch cluster network config ++ +[source,bash] +---- +kubectl --as=cluster-admin patch network.config cluster \ + --type=merge -p '{"spec":{"networkType":"Cilium"},"status":null}' +kubectl --as=cluster-admin patch network.operator cluster \ + --type=merge -p '{"spec":{"defaultNetwork":{"type":"Cilium"}},"status":null}' +---- + +. Apply Cilium manifests. +We need to execute the `apply` twice, since the first apply will fail to create the `CiliumConfig` resource. ++ +[source,bash] +---- +kubectl --as=cluster-admin apply -Rf catalog/manifests/cilium/ +---- ++ +[source,bash] +---- +kubectl --as=cluster-admin apply -Rf catalog/manifests/cilium/ +---- + +. Wait until Cilium CNI is up and running ++ +[source,bash] +---- +kubectl -n cilium get pods -w +---- + +== Finalize migration + +. Re-enable cluster network operator ++ +[IMPORTANT] +==== +This will remove the previously active CNI plugin and will deploy the kube-proxy daemonset. +As soon as you complete this step, existing pods may go into `CrashLoopBackOff` since they were started with CNI IPs managed by the old network plugin. +==== + ++ +[source,bash] +---- +kubectl --as=cluster-admin -n openshift-network-operator \ + scale deployment network-operator --replicas=1 +kubectl --as=cluster-admin patch clusterversion version \ + --type=merge -p '{"spec":{"overrides":null}}' +---- + +. Unpause MCPs ++ +[source,bash] +---- +for mcp in $(kubectl get mcp -o name); do +kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec":{"paused":false}}' +done +---- ++ +[NOTE] +==== +You may need to grab the cluster-admin credentials to complete this step since the OpenShift OAuth components may be unavailable until they're restarted with Cilium-managed IPs. +==== ++ +[TIP] +==== +It may be necessary to force drain nodes manually to allow the machine-config-operator to reboot the nodes. +Use `kubectl --as=cluster-admin drain --ignore-daemonsets --delete-emptydir-data --force --disable-eviction` to circumvent PDB violations if necessary. + +Start with a master node, and ensure that the machine-config-operator is running on that master node after it's been drained and rebooted. +==== + +include::partial$enable-argocd-autosync.adoc[] + +== Cleanup alert silence + +include::partial$remove-alertmanager-silence-all-projectsyn.adoc[] diff --git a/docs/modules/ROOT/partials/alertmanager-silence-job.adoc b/docs/modules/ROOT/partials/alertmanager-silence-job.adoc index 45139245..ab6240e4 100644 --- a/docs/modules/ROOT/partials/alertmanager-silence-job.adoc +++ b/docs/modules/ROOT/partials/alertmanager-silence-job.adoc @@ -19,7 +19,7 @@ spec: restartPolicy: Never containers: - name: silence - image: quay.io/appuio/oc:v4.6 + image: quay.io/appuio/oc:v4.13 command: - bash - -c @@ -41,6 +41,12 @@ ifeval::["{argo_app}" == ""] "name": "syn", "value": "true", "isRegex": false + }, + { + "name": "alertname", + "value": "Watchdog", + "isRegex": false, + "isEqual": false } endif::[] ], @@ -65,12 +71,22 @@ endif::[] - mountPath: /etc/ssl/certs/serving-certs/ name: ca-bundle readOnly: true + - mountPath: /var/run/secrets/kubernetes.io/serviceaccount + name: kube-api-access + readOnly: true serviceAccountName: prometheus-k8s volumes: - name: ca-bundle configMap: defaultMode: 288 name: serving-certs-ca-bundle + - name: kube-api-access + projected: + defaultMode: 420 + sources: + - serviceAccountToken: + expirationSeconds: 3607 + path: 'token' EOJ ---- ifeval::["{http-method}" == "POST"] diff --git a/docs/modules/ROOT/partials/create-alertmanager-silence-all-projectsyn.adoc b/docs/modules/ROOT/partials/create-alertmanager-silence-all-projectsyn.adoc new file mode 100644 index 00000000..c7677318 --- /dev/null +++ b/docs/modules/ROOT/partials/create-alertmanager-silence-all-projectsyn.adoc @@ -0,0 +1,25 @@ +// NOTE: this snippet only works correctly at the beginning of a numbered +// list. I was unable to figure out how to define the page attributes in a way +// that works for the alertmanager-silence-job.adoc partial without breaking +// the list flow. +:silence-target: all +ifndef::duration[] +:duration: +60 minutes +endif::[] +:http-method: POST +:alertmanager-endpoint: /api/v2/silences + +. Silence all Project Syn alerts ++ +TIP: If customer alerts are routed through the cluster-monitoring alertmanager, you should inform the customer that their alerts will be silenced during the migration. ++ +include::partial$alertmanager-silence-job.adoc[] + +. Extract Alertmanager silence ID from job logs ++ +[source,bash] +---- +silence_id=$(kubectl --as=cluster-admin -n openshift-monitoring logs jobs/${job_name} | \ + jq -r '.silenceID') +---- + diff --git a/docs/modules/ROOT/partials/nav.adoc b/docs/modules/ROOT/partials/nav.adoc index e6f8749a..51a5fc20 100644 --- a/docs/modules/ROOT/partials/nav.adoc +++ b/docs/modules/ROOT/partials/nav.adoc @@ -100,7 +100,8 @@ ** xref:oc4:ROOT:how-tos/authentication/disable-self-provisioning.adoc[Disable project self-provisioning] ** xref:oc4:ROOT:explanations/sudo.adoc[] -// Networking +* Networking +** xref:oc4:ROOT:how-tos/network/migrate-to-cilium.adoc[] * Ingress ** xref:oc4:ROOT:how-tos/ingress/self-signed-ingress-cert.adoc[] diff --git a/docs/modules/ROOT/partials/remove-alertmanager-silence-all-projectsyn.adoc b/docs/modules/ROOT/partials/remove-alertmanager-silence-all-projectsyn.adoc new file mode 100644 index 00000000..a7f74de4 --- /dev/null +++ b/docs/modules/ROOT/partials/remove-alertmanager-silence-all-projectsyn.adoc @@ -0,0 +1,18 @@ +// NOTE: this snippet only works correctly at the beginning of a numbered +// list. I was unable to figure out how to define the page attributes in a way +// that works for the alertmanager-silence-job.adoc partial without breaking +// the list flow. +:alertmanager-endpoint: /api/v2/silence/${silence_id} +:silence-target: all +:http-method: DELETE + +. Remove silence in Alertmanager ++ +include::partial$alertmanager-silence-job.adoc[] + +. Clean up Alertmanager silence jobs ++ +[source,bash,subs="attributes+"] +---- +kubectl --as=cluster-admin -n openshift-monitoring delete jobs -l app=silence-{silence-target}-alerts +----