Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add how-to for migrating to Cilium #281

Merged
merged 2 commits into from
Oct 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
264 changes: 264 additions & 0 deletions docs/modules/ROOT/pages/how-tos/network/migrate-to-cilium.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,264 @@
= Migrate to Cilium CNI

== Prerequisites

* `cluster-admin` privileges
* `kubectl`
* `jq`
* `curl`
* Working `commodore` command

== Prepare for migration

IMPORTANT: Make sure that your `$KUBECONFIG` points to the cluster you want to migrate before starting.

:duration: +120 minutes
include::partial$create-alertmanager-silence-all-projectsyn.adoc[]

. Select cluster
+
[source,bash]
----
export CLUSTER_ID=c-cluster-id-1234 <1>
export COMMODORE_API_URL=https://api.syn.vshn.net <2>
export TENANT_ID=$(curl -sH "Authorization: Bearer $(commodore fetch-token)" \
"${COMMODORE_API_URL}/clusters/${CLUSTER_ID}" | jq -r '.tenant')
----
<1> Replace with the Project Syn cluster ID of the cluster to migrate
<2> Replace with the Lieutenant API on which the cluster is registered

. Disable ArgoCD auto sync for components `openshift4-nodes` and `openshift-upgrade-controller`
+
[source,bash]
----
kubectl --as=cluster-admin -n syn patch apps root --type=json \
-p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
kubectl --as=cluster-admin -n syn patch apps openshift4-nodes --type=json \
-p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
kubectl --as=cluster-admin -n syn patch apps openshift-upgrade-controller --type=json \
-p '[{"op":"replace", "path":"/spec/syncPolicy", "value": {}}]'
----

. Disable the cluster-network-operator.
This is necessary to ensure that we can migrate to Cilium without the cluster-network-operator trying to interfere.
We also need to scale down the upgrade controller, so that we can patch the `ClusterVersion` object.
+
[source,bash]
----
kubectl --as=cluster-admin -n appuio-openshift-upgrade-controller \
scale deployment openshift-upgrade-controller-controller-manager --replicas=0
----
+
[source,bash]
----
kubectl --as=cluster-admin patch clusterversion version \
--type=merge \
-p '
{"spec":{"overrides":[
{
"kind": "Deployment",
"group": "apps",
"name": "network-operator",
"namespace": "openshift-network-operator",
"unmanaged": true
}
]}}'
----
+
[source,bash]
----
kubectl --as=cluster-admin -n openshift-network-operator \
scale deploy network-operator --replicas=0
----

. Verify that the network operator has been scaled down.
+
[source,bash]
----
kubectl -n openshift-network-operator get pods <1>
----
<1> This should return `No resources found in openshift-network-operator namespace`.
+
[TIP]
====
If the operator is still running, check the following conditions:

* The APPUiO OpenShift upgrade controller must be scaled down.
* The `ClusterVersion` object must have an override to make the network operator deployment unmanaged.
====

. Remove network operator applied state
+
[source,bash]
----
kubectl --as=cluster-admin -n openshift-network-operator \
delete configmap applied-cluster
----

. Pause all machine config pools
+
[source,bash]
----
for mcp in $(kubectl get mcp -o name); do
kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec": {"paused": true}}'
done
----

== Migrate to Cilium

. Get local cluster working directory
+
[source,bash]
----
commodore catalog compile "$CLUSTER_ID" <1>
----
<1> We recommend switching to an empty directory to run this command.
Alternatively, switch to your existing directory for the cluster.

. Enable component `cilium`
+
[source,bash]
----
pushd inventory/classes/"${TENANT_ID}"
yq -i '.applications += "cilium"' "${CLUSTER_ID}.yml"
----

. Update `upstreamRules` for monitoring
+
[source,bash]
----
yq -i ".parameters.openshift4_monitoring.upstreamRules.networkPlugin = \"cilium\"" \
"${CLUSTER_ID}.yml"
----

. Update component `networkpolicy` config
+
[source,bash]
----
yq eval -i '.parameters.networkpolicy.networkPlugin = "cilium"' \
"${CLUSTER_ID}.yml"
yq eval -i '.parameters.networkpolicy.ignoredNamespaces = ["openshift-oauth-apiserver"]' \
"${CLUSTER_ID}.yml"
----

. Configure component `cilium`.
We explicitly configure the K8s API endpoint to ensure that the Cilium operator doesn't access the API through the cluster network during the migration.
+
TIP: When running Cilium with `kubeProxyReplacement=partial`, the API endpoint configuration can be removed after the migration is completed.
+
.Explicitly configure the K8s API endpoint
[source,bash]
----
yq -i '.parameters.cilium.cilium_helm_values.k8sServiceHost="api-int.${openshift:baseDomain}"' \
"${CLUSTER_ID}.yml" <1>
yq -i '.parameters.cilium.cilium_helm_values.k8sServicePort="6443"' \
"${CLUSTER_ID}.yml"
----
<1> On vSphere clusters, you may need to use `api.${openshift:baseDomain}`.
+
.Configure the cluster Pod and Service CIDRs
[source,bash]
----
POD_CIDR=$(kubectl get network.config cluster \
-o jsonpath='{.spec.clusterNetwork[0].cidr}')
HOST_PREFIX=$(kubectl get network.config cluster \
-o jsonpath='{.spec.clusterNetwork[0].hostPrefix}')

yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.clusterPoolIPv4MaskSize = "'"${HOST_PREFIX}"'"' \
"${CLUSTER_ID}.yml"
yq -i '.parameters.cilium.cilium_helm_values.ipam.operator.clusterPoolIPv4PodCIDR = "'"${POD_CIDR}"'"' \
"${CLUSTER_ID}.yml"
----

. Commit changes
+
[source,bash]
----
git commit -am "Migrate ${CLUSTER_ID} to Cilium"
git push origin master
popd
----

. Compile catalog
+
[source,yaml]
----
commodore catalog compile "${CLUSTER_ID}"
----

. Patch cluster network config
+
[source,bash]
----
kubectl --as=cluster-admin patch network.config cluster \
--type=merge -p '{"spec":{"networkType":"Cilium"},"status":null}'
kubectl --as=cluster-admin patch network.operator cluster \
--type=merge -p '{"spec":{"defaultNetwork":{"type":"Cilium"}},"status":null}'
----

. Apply Cilium manifests.
We need to execute the `apply` twice, since the first apply will fail to create the `CiliumConfig` resource.
+
[source,bash]
----
kubectl --as=cluster-admin apply -Rf catalog/manifests/cilium/
----
+
[source,bash]
----
kubectl --as=cluster-admin apply -Rf catalog/manifests/cilium/
----

. Wait until Cilium CNI is up and running
+
[source,bash]
----
kubectl -n cilium get pods -w
----

== Finalize migration

. Re-enable cluster network operator
+
[IMPORTANT]
====
This will remove the previously active CNI plugin and will deploy the kube-proxy daemonset.
As soon as you complete this step, existing pods may go into `CrashLoopBackOff` since they were started with CNI IPs managed by the old network plugin.
====

+
[source,bash]
----
kubectl --as=cluster-admin -n openshift-network-operator \
scale deployment network-operator --replicas=1
kubectl --as=cluster-admin patch clusterversion version \
--type=merge -p '{"spec":{"overrides":null}}'
----

. Unpause MCPs
+
[source,bash]
----
for mcp in $(kubectl get mcp -o name); do
kubectl --as=cluster-admin patch $mcp --type=merge -p '{"spec":{"paused":false}}'
done
----
+
[NOTE]
====
You may need to grab the cluster-admin credentials to complete this step since the OpenShift OAuth components may be unavailable until they're restarted with Cilium-managed IPs.
====
+
[TIP]
====
It may be necessary to force drain nodes manually to allow the machine-config-operator to reboot the nodes.
Use `kubectl --as=cluster-admin drain --ignore-daemonsets --delete-emptydir-data --force --disable-eviction` to circumvent PDB violations if necessary.

Start with a master node, and ensure that the machine-config-operator is running on that master node after it's been drained and rebooted.
====

include::partial$enable-argocd-autosync.adoc[]

== Cleanup alert silence

include::partial$remove-alertmanager-silence-all-projectsyn.adoc[]
33 changes: 29 additions & 4 deletions docs/modules/ROOT/partials/alertmanager-silence-job.adoc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[source,bash,subs="attributes+"]
----
if [[ "$OSTYPE" == "darwin"* ]]; then alias date=gdate; fi
job_name=$(printf "{http-method}-silence-{argo_app}-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]')
job_name=$(printf "{http-method}-silence-{silence-target}-alerts-$(date +%s)" | tr '[:upper:]' '[:lower:]')
ifeval::["{http-method}" == "POST"]
silence_duration='{duration}' <1>
endif::[]
Expand All @@ -11,15 +11,15 @@ kind: Job
metadata:
name: ${job_name}
labels:
app: silence-{argo_app}-alerts
app: silence-{silence-target}-alerts
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: silence
image: quay.io/appuio/oc:v4.6
image: quay.io/appuio/oc:v4.13
command:
- bash
- -c
Expand All @@ -29,16 +29,31 @@ ifeval::["{http-method}" == "POST"]
read -d "" body << EOF
{
"matchers": [
ifeval::["{argo_app}" != ""]
{
"name": "syn_component",
"value": "{argo_app}",
"isRegex": false
}
endif::[]
ifeval::["{argo_app}" == ""]
{
"name": "syn",
"value": "true",
"isRegex": false
},
{
"name": "alertname",
"value": "Watchdog",
"isRegex": false,
"isEqual": false
}
endif::[]
],
"startsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S')",
"endsAt": "$(date -u +'%Y-%m-%dT%H:%M:%S' --date "${silence_duration}")",
"createdBy": "$(kubectl config current-context | cut -d/ -f3)",
"comment": "Silence all {argo_app} alerts"
"comment": "Silence {silence-target} alerts"
}
EOF

Expand All @@ -56,12 +71,22 @@ endif::[]
- mountPath: /etc/ssl/certs/serving-certs/
name: ca-bundle
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access
readOnly: true
serviceAccountName: prometheus-k8s
volumes:
- name: ca-bundle
configMap:
defaultMode: 288
name: serving-certs-ca-bundle
- name: kube-api-access
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: 'token'
EOJ
----
ifeval::["{http-method}" == "POST"]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
// NOTE: this snippet only works correctly at the beginning of a numbered
// list. I was unable to figure out how to define the page attributes in a way
// that works for the alertmanager-silence-job.adoc partial without breaking
// the list flow.
:silence-target: all
ifndef::duration[]
:duration: +60 minutes
endif::[]
:http-method: POST
:alertmanager-endpoint: /api/v2/silences

. Silence all Project Syn alerts
+
TIP: If customer alerts are routed through the cluster-monitoring alertmanager, you should inform the customer that their alerts will be silenced during the migration.
+
include::partial$alertmanager-silence-job.adoc[]

. Extract Alertmanager silence ID from job logs
+
[source,bash]
----
silence_id=$(kubectl --as=cluster-admin -n openshift-monitoring logs jobs/${job_name} | \
jq -r '.silenceID')
----

Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
// the list flow.
:http-method: POST
:alertmanager-endpoint: /api/v2/silences
:silence-target: {argo_app}

. Set a silence in Alertmanager for all {argo_app} alerts
+
Expand Down
3 changes: 2 additions & 1 deletion docs/modules/ROOT/partials/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,8 @@
** xref:oc4:ROOT:how-tos/authentication/disable-self-provisioning.adoc[Disable project self-provisioning]
** xref:oc4:ROOT:explanations/sudo.adoc[]

// Networking
* Networking
** xref:oc4:ROOT:how-tos/network/migrate-to-cilium.adoc[]

* Ingress
** xref:oc4:ROOT:how-tos/ingress/self-signed-ingress-cert.adoc[]
Expand Down
Loading