Skip to content

Commit

Permalink
Add how-to for migrating to Cilium
Browse files Browse the repository at this point in the history
  • Loading branch information
simu committed Oct 18, 2023
1 parent b8b3b71 commit 81bae03
Show file tree
Hide file tree
Showing 5 changed files with 326 additions and 2 deletions.
264 changes: 264 additions & 0 deletions docs/modules/ROOT/pages/how-tos/network/migrate-to-cilium.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
= Migrate to Cilium CNI

== Prerequisites

* `cluster-admin` privileges
* `kubectl`
* `jq`
* `curl`
* Working `commodore` command

== Prepare for migration

IMPORTANT: Make sure that your `$KUBECONFIG` points to the cluster you want to migrate before starting.

:duration: +120 minutes
include::partial$create-alertmanager-silence-all-projectsyn.adoc[]

. Select cluster
+
[source,bash]
----
export CLUSTER_ID=c-cluster-id-1234 <1>
export COMMODORE_API_URL=https://api.syn.vshn.net <2>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" \
"${COMMODORE_API_URL}/clusters/${CLUSTER_ID}" | jq -r '.tenant')
----
<1> Replace with the Project Syn cluster ID of the cluster to migrate
<2> Replace with the Lieutenant API on which the cluster is registered

. Disable ArgoCD auto sync for components `openshift4-nodes` and `openshift-upgrade-controller`
+
[source,bash]
----
kubectl --as=cluster-admin -n syn patch apps root --type=json \
-p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
kubectl --as=cluster-admin -n syn patch apps openshift4-nodes --type=json \
-p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
kubectl --as=cluster-admin -n syn patch apps openshift-upgrade-controller --type=json \
-p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
----

. Disable the cluster-network-operator.
This is necessary to ensure that we can migrate to Cilium without the cluster-network-operator trying to interfere.
We also need to scale down the upgrade controller, so that we can patch the `ClusterVersion` object.
+
[source,bash]
----
kubectl --as=cluster-admin -n appuio-openshift-upgrade-controller \
scale deployment openshift-upgrade-controller-controller-manager --replicas=0
----
+
[source,bash]
----
kubectl --as=cluster-admin patch clusterversion version \
--type=merge \
-p '
{"spec":{"overrides":[
{
"kind": "Deployment",
"group": "apps",
"name": "network-operator",
"namespace": "openshift-network-operator",
"unmanaged": true
}
]}}'
----
+
[source,bash]
----
kubectl --as=cluster-admin -n openshift-network-operator \
scale deploy network-operator --replicas=0
----

. Verify that the network operator has been scaled down.
+
[source,bash]
----
kubectl -n openshift-network-operator get pods <1>
----
<1> This should return `No resources found in openshift-network-operator namespace`.
+
[TIP]
====
If the operator is still running, check the following conditions:
* The APPUiO OpenShift upgrade controller must be scaled down.
* The `ClusterVersion` object must have an override to make the network operator deployment unmanaged.
====

. Remove network operator applied state
+
[source,bash]
----
kubectl --as=cluster-admin -n openshift-network-operator \
delete configmap applied-cluster
----

. Pause all machine config pools
+
[source,bash]
----
for mcp in $(kubectl get mcp -o name); do
kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec": {"paused": true}}'
done
----

== Migrate to Cilium

. Get local cluster working directory
+
[source,bash]
----
commodore catalog compile "$CLUSTER_ID" <1>
----
<1> We recommend switching to an empty directory to run this command.
Alternatively, switch to your existing directory for the cluster.

. Enable component `cilium`
+
[source,bash]
----
pushd inventory/classes/"${TENANT_ID}"
yq -i '.applications += "cilium"' "${CLUSTER_ID}.yml"
----

. Update `upstreamRules` for monitoring
+
[source,bash]
----
yq -i ".parameters.openshift4_monitoring.upstreamRules.networkPlugin = \"cilium\"" \
"${CLUSTER_ID}.yml"
----

. Update component `networkpolicy` config
+
[source,bash]
----
yq eval -i '.parameters.networkpolicy.networkPlugin = "cilium"' \
"${CLUSTER_ID}.yml"
yq eval -i '.parameters.networkpolicy.ignoredNamespaces = ["openshift-oauth-apiserver"]' \
"${CLUSTER_ID}.yml"
----

. Configure component `cilium`.
We explicitly configure the K8s API endpoint to ensure that the Cilium operator doesn't access the API through the cluster network.
+
TIP: When running Cilium with `kubeProxyReplacement=partial`, the API endpoint configuration can be removed after the migration is completed.
+
.Explicitly configure the K8s API endpoint
[source,bash]
----
yq -i '.parameters.cilium.cilium_helm_values.k8sServiceHost="api-int.${openshift:baseDomain}"' \
"${CLUSTER_ID}.yml" <1>
yq -i '.parameters.cilium.cilium_helm_values.k8sServicePort="6443"' \
"${CLUSTER_ID}.yml"
----
<1> On vSphere clusters, you may need to use `api.${openshift:baseDomain}`.
+
.Configure the cluster Pod and Service CIDRs
[source,bash]
----
POD_CIDR=$(kubectl get network.config cluster \
-o jsonpath='{.spec.clusterNetwork[0].cidr}')
HOST_PREFIX=$(kubectl get network.config cluster \
-o jsonpath='{.spec.clusterNetwork[0].hostPrefix}')
yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.clusterPoolIPv4MaskSize = "'"${HOST_PREFIX}"'"' \
"${CLUSTER_ID}.yml"
yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.clusterPoolIPv4PodCIDR = "'"${POD_CIDR}"'"' \
"${CLUSTER_ID}.yml"
----

. Commit changes
+
[source,bash]
----
git commit -am "Migrate ${CLUSTER_ID} to Cilium"
git push origin master
popd
----

. Compile catalog
+
[source,yaml]
----
commodore catalog compile "${CLUSTER_ID}"
----

. Patch cluster network config
+
[source,bash]
----
kubectl --as=cluster-admin patch network.config cluster \
--type=merge -p '{"spec":{"networkType":"Cilium"},"status":null}'
kubectl --as=cluster-admin patch network.operator cluster \
--type=merge -p '{"spec":{"defaultNetwork":{"type":"Cilium"}},"status":null}'
----

. Apply Cilium manifests.
We need to execute the `apply` twice, since the first apply will fail to create the `CiliumConfig` resource.
+
[source,bash]
----
kubectl --as=cluster-admin apply -Rf catalog/manifests/cilium/
----
+
[source,bash]
----
kubectl --as=cluster-admin apply -Rf catalog/manifests/cilium/
----

. Wait until Cilium CNI is up and running
+
[source,bash]
----
kubectl -n cilium get pods -w
----

== Finalize migration

. Re-enable cluster network operator
+
[IMPORTANT]
====
This will remove the previously active CNI plugin and will deploy the kube-proxy daemonset.
As soon as you complete this step, existing pods may go into `CrashLoopBackOff` since they were started with CNI IPs managed by the old network plugin.
====

+
[source,bash]
----
kubectl --as=cluster-admin -n openshift-network-operator \
scale deployment network-operator --replicas=1
kubectl --as=cluster-admin patch clusterversion version \
--type=merge -p '{"spec":{"overrides":null}}'
----

. Unpause MCPs
+
[source,bash]
----
for mcp in $(kubectl get mcp -o name); do
kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec":{"paused":false}}'
done
----
+
[NOTE]
====
You may need to grab the cluster-admin credentials to complete this step since the OpenShift OAuth components may be unavailable until they're restarted with Cilium-managed IPs.
====
+
[TIP]
====
It may be necessary to force drain nodes manually to allow the machine-config-operator to reboot the nodes.
Use `kubectl --as=cluster-admin drain --ignore-daemonsets --delete-emptydir-data --force --disable-eviction` to circumvent PDB violations if necessary.
Start with a master node, and ensure that the machine-config-operator is running on that master node after it's been drained and rebooted.
====

include::partial$enable-argocd-autosync.adoc[]

== Cleanup alert silence

include::partial$remove-alertmanager-silence-all-projectsyn.adoc[]
18 changes: 17 additions & 1 deletion docs/modules/ROOT/partials/alertmanager-silence-job.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ spec:
restartPolicy: Never
containers:
- name: silence
image: quay.io/appuio/oc:v4.6
image: quay.io/appuio/oc:v4.13
command:
- bash
- -c
Expand All @@ -41,6 +41,12 @@ ifeval::["{argo_app}" == ""]
"name": "syn",
"value": "true",
"isRegex": false
},
{
"name": "alertname",
"value": "Watchdog",
"isRegex": false,
"isEqual": false
}
endif::[]
],
Expand All @@ -65,12 +71,22 @@ endif::[]
- mountPath: /etc/ssl/certs/serving-certs/
name: ca-bundle
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access
readOnly: true
serviceAccountName: prometheus-k8s
volumes:
- name: ca-bundle
configMap:
defaultMode: 288
name: serving-certs-ca-bundle
- name: kube-api-access
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: 'token'
EOJ
----
ifeval::["{http-method}" == "POST"]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
// NOTE: this snippet only works correctly at the beginning of a numbered
// list. I was unable to figure out how to define the page attributes in a way
// that works for the alertmanager-silence-job.adoc partial without breaking
// the list flow.
:silence-target: all
ifndef::duration[]
:duration: +60 minutes
endif::[]
:http-method: POST
:alertmanager-endpoint: /api/v2/silences

. Silence all Project Syn alerts
+
TIP: If customer alerts are routed through the cluster-monitoring alertmanager, you should inform the customer that their alerts will be silenced during the migration.
+
include::partial$alertmanager-silence-job.adoc[]

. Extract Alertmanager silence ID from job logs
+
[source,bash]
----
silence_id=$(kubectl --as=cluster-admin -n openshift-monitoring logs jobs/${job_name} | \
jq -r '.silenceID')
----

3 changes: 2 additions & 1 deletion docs/modules/ROOT/partials/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,8 @@
** xref:oc4:ROOT:how-tos/authentication/disable-self-provisioning.adoc[Disable project self-provisioning]
** xref:oc4:ROOT:explanations/sudo.adoc[]
// Networking
* Networking
** xref:oc4:ROOT:how-tos/network/migrate-to-cilium.adoc[]
* Ingress
** xref:oc4:ROOT:how-tos/ingress/self-signed-ingress-cert.adoc[]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
// NOTE: this snippet only works correctly at the beginning of a numbered
// list. I was unable to figure out how to define the page attributes in a way
// that works for the alertmanager-silence-job.adoc partial without breaking
// the list flow.
:alertmanager-endpoint: /api/v2/silence/${silence_id}
:silence-target: all
:http-method: DELETE

. Remove silence in Alertmanager
+
include::partial$alertmanager-silence-job.adoc[]

. Clean up Alertmanager silence jobs
+
[source,bash,subs="attributes+"]
----
kubectl --as=cluster-admin -n openshift-monitoring delete jobs -l app=silence-{silence-target}-alerts
----

0 comments on commit 81bae03

Please sign in to comment.