diff --git a/content/en/docs/features/monitoring-and-diagnostics/disaster-recovery.md b/content/en/docs/features/monitoring-and-diagnostics/disaster-recovery.md index c4be6514..f5278fce 100644 --- a/content/en/docs/features/monitoring-and-diagnostics/disaster-recovery.md +++ b/content/en/docs/features/monitoring-and-diagnostics/disaster-recovery.md @@ -74,7 +74,7 @@ To back up the database, two approaches can be used: ![active-backup-replica-set](/docs/features/monitoring-and-diagnostics/static/disaster-recovery-active-replica-set-backup.drawio.svg) - The primary and standby cluster MongoDB members are in the same MongoDB replica set. The standby cluster members are configured as [hidden](https://www.mongodb.com/docs/manual/core/replica-set-hidden-member), [delayed](https://www.mongodb.com/docs/manual/core/replica-set-delayed-member/), and with [zero priority](https://www.mongodb.com/docs/manual/core/replica-set-priority-0-member/). When the primary cluster goes down, the standby cluster MongoDB members are promoted to standby state—one of them will become primary by administrator. After the primary is back online, the primary cluster members will be demoted to hidden. For switching back, the primary cluster members will be promoted to secondary MongoDB members and standby cluster members will be demoted. **This approach is supported by the plgd hub helm chart because it complies with the MongoDB Community Server license.** For setup instructions, please refer to this [tutorial](). + The primary and standby cluster MongoDB members are in the same MongoDB replica set. The standby cluster members are configured as [hidden](https://www.mongodb.com/docs/manual/core/replica-set-hidden-member), [delayed](https://www.mongodb.com/docs/manual/core/replica-set-delayed-member/), and with [zero priority](https://www.mongodb.com/docs/manual/core/replica-set-priority-0-member/). When the primary cluster goes down, the standby cluster MongoDB members are promoted to standby state—one of them will become primary by administrator. After the primary is back online, the primary cluster members will be demoted to hidden. For switching back, the primary cluster members will be promoted to secondary MongoDB members and standby cluster members will be demoted. **This approach is supported by the plgd hub helm chart because it complies with the MongoDB Community Server license.** For setup instructions, please refer to this [tutorial](/docs/tutorials/disaster-recovery-replica-set/). * **Cluster to cluster synchronization** diff --git a/content/en/docs/tutorials/disaster-recovery-replica-set.md b/content/en/docs/tutorials/disaster-recovery-replica-set.md new file mode 100644 index 00000000..87068168 --- /dev/null +++ b/content/en/docs/tutorials/disaster-recovery-replica-set.md @@ -0,0 +1,681 @@ +--- +title: 'Disaster Recovery via Replica Set' +description: 'How to perform disaster recovery with a Replica Set?' +date: '2024-06-17' +categories: [architecture, d2c, provisioning, disaster-recovery] +keywords: [architecture, d2c, provisioning, disaster-recovery] +weight: 15 +--- + +The plgd-hub Helm charts support disaster recovery via a MongoDB replica set because the source of truth is stored in the MongoDB database. It is required that devices have configured **device provisioning endpoints** for both clusters' device provisioning services. In this tutorial, we have two MicroK8s clusters: primary and standby. Each of them uses three root CA certificates: + +- `external CA certificate pair`: Used for public APIs (CoAP, HTTPS, gRPC) and is the same for both clusters. +- `internal CA certificate pair`: Used for plgd services to communicate with each other, MongoDB, and NATs. Each cluster has its own internal CA certificate. +- `storage CA certificate pair`: Used for MongoDB. Each cluster has its own storage CA certificate. + +We also use an `authorization CA certificate` to communicate with the OAuth2 authorization server. In this tutorial, `mock-oauth-server` and its certificate are signed by the `external CA certificate pair`. Thus, we have only one `authorization CA certificate` for both clusters, which is the `external CA certificate`. + +The goal is to ensure that only MongoDBs from the primary and standby clusters can communicate with each other, while plgd services can only connect to the MongoDB in their respective clusters. All APIs will be available on the root domain `primary.plgd.cloud` for the primary cluster and `standby.plgd.cloud` for the standby cluster. Additionally, MongoDB members are exposed via the LoadBalancer service type, and each member needs its own DNS name. + +| For the primary cluster we have | For the standby cluster we have | +| --- | --- | +| `mongodb-0.primary.plgd.cloud` | `mongodb-0.standby.plgd.cloud` | +| `mongodb-1.primary.plgd.cloud` | `mongodb-1.standby.plgd.cloud` | +| `mongodb-2.primary.plgd.cloud` | `mongodb-2.standby.plgd.cloud` | +| `mongodb.primary.plgd.cloud` | `----` | + +The `mongodb.primary.plgd.cloud` is used for external access to the MongoDB replica set for the standby cluster. This DNS record is an alias for all members of the primary cluster. + +This DNS needs to be resolved to the external IP address of the LoadBalancer. The external IP address of the LoadBalancer is used to connect to the MongoDB replica set from the other cluster. For clouds, you can use the [external-dns](https://github.com/kubernetes-sigs/external-dns/) tool to create DNS records in AWS Route53 / Google Cloud DNS / Azure DNS. +In this tutorial, we show how to get the IPs of MongoDB services, and we will set them manually in /etc/hosts, then restart the dnsmasq daemon to load these changes on the computer with IP 192.168.1.1. + +{{< note >}} +It is also recommended to set up a firewall between clusters with source IP address filtering to mitigate DDOS attacks on MongoDB. The default port for MongoDB is 27017. Alternatively, use a VPN to interconnect clusters. +{{< /note >}} + +## Installation + +### MicroK8s Prerequisites + +The following addons are expected to be enabled on both clusters, with **Kubernetes v1.24+** installed. + +```yaml +addons: + enabled: + cert-manager # (core) Cloud-native certificate management + dns # (core) CoreDNS + ha-cluster # (core) Configure high availability on the current node + helm # (core) Helm - the package manager for Kubernetes + helm3 # (core) Helm 3 - the package manager for Kubernetes + hostpath-storage # (core) Storage class; allocates storage from host directory + ingress # (core) Ingress controller for external access + metallb # (core) Loadbalancer for your Kubernetes cluster +``` + +The [dns](https://microk8s.io/docs/addon-dns) addon is configured to use a DNS server that hosts all records for `primary.plgd.cloud` and `standby.plgd.cloud` domains. To configure DNS in MicroK8s, you can use the following command: + +```bash +microk8s disable dns +microk8s enable dns:192.168.1.1 +``` + +For [metallb](https://microk8s.io/docs/addon-metallb), we need to set up the IP address pool for the LoadBalancer service type. The IP address pool needs to be accessible from the network where the MicroK8s is running. It is important that the IP address is not used by any other device in the network and that the DHCP server is not assigning this IP address to any device. + +Example for the primary cluster: + +```bash +microk8s disable metallb +microk8s enable metallb:192.168.1.200-192.168.1.219 +``` + +Example for the standby cluster: + +```bash +microk8s disable metallb +microk8s enable metallb:192.168.1.220-192.168.1.239 +``` + +### Creating Certificates + +To create certificates, you can use the cert-tool Docker image to generate root CA certificates for the services. + +1. Create the external CA certificate pair (same for both clusters): + + ```bash + mkdir -p .tmp/certs/external + docker run \ + --rm -v $(pwd)/.tmp/certs/external:/certs \ + --user $(id -u):$(id -g) \ + ghcr.io/plgd-dev/hub/cert-tool:vnext \ + --cmd.generateRootCA --outCert=/certs/tls.crt --outKey=/certs/tls.key \ + --cert.subject.cn=external.root.ca --cert.validFor=876000h + ``` + +2. Create the internal CA certificate pair for the primary cluster: + + ```bash + mkdir -p .tmp/primary/certs/internal + docker run \ + --rm -v $(pwd)/.tmp/primary/certs/internal:/certs \ + --user $(id -u):$(id -g) \ + ghcr.io/plgd-dev/hub/cert-tool:vnext \ + --cmd.generateRootCA --outCert=/certs/tls.crt --outKey=/certs/tls.key \ + --cert.subject.cn=primary.internal.root.ca --cert.validFor=876000h + ``` + +3. Create the storage CA certificate pair for the primary cluster: + + ```bash + mkdir -p .tmp/primary/certs/storage + docker run \ + --rm -v $(pwd)/.tmp/primary/certs/storage:/certs \ + --user $(id -u):$(id -g) \ + ghcr.io/plgd-dev/hub/cert-tool:vnext \ + --cmd.generateRootCA --outCert=/certs/tls.crt --outKey=/certs/tls.key \ + --cert.subject.cn=primary.storage.root.ca --cert.validFor=876000h + ``` + +4. Create the internal CA certificate pair for the standby cluster: + + ```bash + mkdir -p .tmp/standby/certs/internal + docker run \ + --rm -v $(pwd)/.tmp/standby/certs/internal:/certs \ + --user $(id -u):$(id -g) \ + ghcr.io/plgd-dev/hub/cert-tool:vnext \ + --cmd.generateRootCA --outCert=/certs/tls.crt --outKey=/certs/tls.key \ + --cert.subject.cn=standby.internal.root.ca --cert.validFor=876000h + ``` + +5. Create the storage CA certificate pair for the standby cluster: + + ```bash + mkdir -p .tmp/standby/certs/storage + docker run \ + --rm -v $(pwd)/.tmp/standby/certs/storage:/certs \ + --user $(id -u):$(id -g) \ + ghcr.io/plgd-dev/hub/cert-tool:vnext \ + --cmd.generateRootCA --outCert=/certs/tls.crt --outKey=/certs/tls.key \ + --cert.subject.cn=standby.storage.root.ca --cert.validFor=876000h + ``` + +### Setting up cert-manager on the Primary Cluster + +Ensure that you have cert-manager installed. + +1. Create an external TLS secret for issuing certificates: + + ```bash + kubectl -n cert-manager create secret tls external-plgd-ca-tls \ + --cert=.tmp/certs/external/tls.crt \ + --key=.tmp/certs/external/tls.key + ``` + +2. Create a ClusterIssuer that points to `external-plgd-ca-tls`: + + ```bash + cat < values.yaml +global: + domain: "$DOMAIN" + hubId: "$HUB_ID" + ownerClaim: "$OWNER_CLAIM" + standby: $STANDBY + extraCAPool: + authorization: | + $AUTHORIZATION_CA_IN_PEM + internal: | + $INTERNAL_CA_IN_PEM + $STORAGE_PRIMARY_CA_IN_PEM + $EXTERNAL_CA_IN_PEM + storage: | + $STORAGE_PRIMARY_CA_IN_PEM + $STORAGE_STANDBY_CA_IN_PEM + $INTERNAL_CA_IN_PEM +mockoauthserver: + enabled: true + oauth: + - name: "plgd.dps" + clientID: "test" + clientSecret: "test" + grantType: "clientCredentials" + redirectURL: "https://$DOMAIN/things" + scopes: ['openid'] + - name: "plgd.web" + clientID: "test" + clientSecret: "test" + redirectURL: "https://$DOMAIN/things" + scopes: ['openid'] + useInUi: true +mongodb: + tls: + extraDnsNames: + - "mongodb.$DOMAIN" + externalAccess: + enabled: true + service: + type: LoadBalancer + publicNames: + - "mongodb-0.$DOMAIN" + - "mongodb-1.$DOMAIN" + - "mongodb-2.$DOMAIN" + annotationsList: + - external-dns.alpha.kubernetes.io/hostname: "mongodb-0.$DOMAIN" + - external-dns.alpha.kubernetes.io/hostname: "mongodb-1.$DOMAIN" + - external-dns.alpha.kubernetes.io/hostname: "mongodb-2.$DOMAIN" +certmanager: + storage: + issuer: + kind: ClusterIssuer + name: storage-plgd-ca-issuer + internal: + issuer: + kind: ClusterIssuer + name: internal-plgd-ca-issuer + default: + ca: + issuerRef: + kind: ClusterIssuer + name: external-plgd-ca-issuer +httpgateway: + apiDomain: "$DOMAIN" +grpcgateway: + domain: "$DOMAIN" +certificateauthority: + domain: "$DOMAIN" +coapgateway: + service: + type: NodePort + nodePort: 15684 +resourcedirectory: + publicConfiguration: + coapGateway: "coaps+tcp://$DOMAIN:15684" +deviceProvisioningService: + apiDomain: "$DOMAIN" + service: + type: NodePort + image: + dockerConfigSecret: | + { + "auths": { + "ghcr.io": { + "auth": "$DOCKER_AUTH_TOKEN" + } + } + } + enrollmentGroups: + - id: "5db6ccde-05e1-480b-a522-c1591ad7dfd2" + owner: "1" + attestationMechanism: + x509: + certificateChain: |- + $MANUFACTURER_CERTIFICATE_CA + hub: + coapGateway: "$DOMAIN:15684" + certificateAuthority: + grpc: + address: "$DOMAIN:443" + authorization: + provider: + name: "plgd.dps" + clientId: "test" + clientSecret: "test" + audience: "https://$DOMAIN" +EOF +helm upgrade -i -n plgd --create-namespace -f values.yaml hub plgd/plgd-hub +helm upgrade -i -n plgd --create-namespace -f values.yaml dps plgd/plgd-dps +``` + +Now we need to get the IP addresses of the MongoDB members and set them to the DNS. The external IP address of the LoadBalancer is used to connect to the MongoDB replica set from the other cluster. + +```bash +kubectl -n plgd get services | grep mongodb | grep LoadBalancer | awk '{print $1 ":" $4}' +mongodb-0-external:192.168.1.202 +mongodb-1-external:192.168.1.200 +mongodb-2-external:192.168.1.201 +``` + +Next, we need to set the DNS records for the primary cluster to the DNS server running on `192.168.1.1`. The `mongodb.primary.plgd.cloud` is an alias to all members of the primary cluster. + +```bash +echo " +192.168.1.202 mongodb-0.primary.plgd.cloud mongodb.primary.plgd.cloud +192.168.1.200 mongodb-1.primary.plgd.cloud mongodb.primary.plgd.cloud +192.168.1.201 mongodb-2.primary.plgd.cloud mongodb.primary.plgd.cloud +" | sudo tee -a /etc/hosts +sudo systemctl restart dnsmasq +``` + +After some time for the pods to start, you can access the Hub at `https://primary.plgd.cloud`. + +### Deploy plgd on Standby Cluster + +Deploying plgd to the standby cluster is similar to deploying it to the primary cluster. The differences are that the domain is `standby.plgd.cloud`, different internal and storage certificates are used, the standby flag is set to `true`, MongoDB is configured to use the master DB at `mongodb.primary.plgd.cloud`, and the `mongodb-standby-tool` job is enabled to configure the MongoDB replica set. + +```bash +# Set variables +DOMAIN="standby.plgd.cloud" +PRIMARY_MONGO_DB="mongodb.primary.plgd.cloud" +HUB_ID="d03a1bb4-0a77-428c-b78c-1c46efe6a38e" +OWNER_CLAIM="https://plgd.dev/owner" +STANDBY=true +DOCKER_AUTH_TOKEN="" + +# Read certificate files +AUTHORIZATION_CA_IN_PEM=$(cat .tmp/certs/external/tls.crt) +EXTERNAL_CA_IN_PEM=$(cat .tmp/certs/external/tls.crt) +INTERNAL_CA_IN_PEM=$(cat .tmp/standby/certs/internal/tls.crt) +STORAGE_PRIMARY_CA_IN_PEM=$(cat .tmp/primary/certs/storage/tls.crt) +STORAGE_STANDBY_CA_IN_PEM=$(cat .tmp/standby/certs/storage/tls.crt) +MANUFACTURER_CERTIFICATE_CA="" + +# Create values.yaml file +cat < values.yaml +global: + domain: "$DOMAIN" + hubId: "$HUB_ID" + ownerClaim: "$OWNER_CLAIM" + standby: $STANDBY + extraCAPool: + authorization: | + $AUTHORIZATION_CA_IN_PEM + internal: | + $INTERNAL_CA_IN_PEM + $STORAGE_STANDBY_CA_IN_PEM + $EXTERNAL_CA_IN_PEM + storage: | + $STORAGE_PRIMARY_CA_IN_PEM + $STORAGE_STANDBY_CA_IN_PEM + $INTERNAL_CA_IN_PEM +mockoauthserver: + enabled: true + oauth: + - name: "plgd.dps" + clientID: "test" + clientSecret: "test" + grantType: "clientCredentials" + redirectURL: "https://$DOMAIN/things" + scopes: ['openid'] + - name: "plgd.web" + clientID: "test" + clientSecret: "test" + redirectURL: "https://$DOMAIN/things" + scopes: ['openid'] + useInUi: true +mongodb: + standbyTool: + enabled: true + replicaSet: + standby: + members: + - "mongodb-0.$DOMAIN:27017" + - "mongodb-1.$DOMAIN:27017" + - "mongodb-2.$DOMAIN:27017" + externalAccess: + enabled: true + externalMaster: + enabled: true + host: "$PRIMARY_MONGO_DB" + service: + type: LoadBalancer + publicNames: + - "mongodb-0.$DOMAIN" + - "mongodb-1.$DOMAIN" + - "mongodb-2.$DOMAIN" + annotationsList: + - external-dns.alpha.kubernetes.io/hostname: "mongodb-0.$DOMAIN" + - external-dns.alpha.kubernetes.io/hostname: "mongodb-1.$DOMAIN" + - external-dns.alpha.kubernetes.io/hostname: "mongodb-2.$DOMAIN" +certmanager: + storage: + issuer: + kind: ClusterIssuer + name: storage-plgd-ca-issuer + internal: + issuer: + kind: ClusterIssuer + name: internal-plgd-ca-issuer + default: + ca: + issuerRef: + kind: ClusterIssuer + name: external-plgd-ca-issuer +httpgateway: + apiDomain: "$DOMAIN" +grpcgateway: + domain: "$DOMAIN" +certificateauthority: + domain: "$DOMAIN" +coapgateway: + service: + type: NodePort + nodePort: 15684 +resourcedirectory: + publicConfiguration: + coapGateway: "coaps+tcp://$DOMAIN:15684" +deviceProvisioningService: + apiDomain: "$DOMAIN" + service: + type: NodePort + image: + dockerConfigSecret: | + { + "auths": { + "ghcr.io": { + "auth": "$DOCKER_AUTH_TOKEN" + } + } + } + enrollmentGroups: + - id: "5db6ccde-05e1-480b-a522-c1591ad7dfd2" + owner: "1" + attestationMechanism: + x509: + certificateChain: |- + $MANUFACTURER_CERTIFICATE_CA + hub: + coapGateway: "$DOMAIN:15684" + certificateAuthority: + grpc: + address: "$DOMAIN:443" + authorization: + provider: + name: "plgd.dps" + clientId: "test" + clientSecret: "test" + audience: "https://$DOMAIN" +EOF +helm upgrade -i -n plgd --create-namespace -f values.yaml hub plgd/plgd-hub +helm upgrade -i -n plgd --create-namespace -f values.yaml dps plgd/plgd-dps +``` + +Next, we need to get the IP addresses of the MongoDB members and set them to the DNS server running on `192.168.1.1`, similar to the primary cluster. + +```bash +kubectl -n plgd get services | grep mongodb | grep LoadBalancer | awk '{print $1 ":" $4}' +echo " +192.168.1.222 mongodb-0.standby.plgd.cloud +192.168.1.220 mongodb-1.standby.plgd.cloud +192.168.1.221 mongodb-2.standby.plgd.cloud +" | sudo tee -a /etc/hosts +sudo systemctl restart dnsmasq +``` + +<< note >> +It is important that the `global.standby` flag is set to `true`, which means that plgd pods are not running on the standby cluster. +<< /note >> + +Once the MongoDB pods are running, we need to run the `mongodb-standby-tool` job to configure the MongoDB replica set. This configuration demotes the secondary members to hidden members. + +```bash +kubectl -n plgd patch job/$(kubectl -n standby-mock-plgd-cloud get jobs | grep mongodb-standby-tool | awk '{print $1}') --type=strategic --patch '{"spec":{"suspend":false}}' +``` + +Now the job will create the pod and configure the MongoDB replica set. + +## Disaster Recovery + +<< note >> +This steps could be used in case of planned maintenance. +<< /note >> + +### How to Switch to the Standby Cluster + +When the primary cluster is down, you need to switch to the standby cluster. + +#### Promote the Standby Cluster + +First, promote the hidden members to secondary members. To do this, upgrade the Helm chart with the `mongodb.standbyTool.mode` set to `active`. The active mode reconfigures the MongoDB replica set, promoting hidden members to secondary members and demoting the previous members to hidden. + +```bash +helm upgrade -i -n plgd --create-namespace -f values.yaml --set mongodb.standbyTool.mode=active hub plgd/plgd-hub +``` + +Next, delete the `mongodb-standby-tool` job and resume it to configure the MongoDB replica set. + +```bash +kubectl -n plgd delete job/$(kubectl -n plgd get jobs | grep mongodb-standby-tool | awk '{print $1}') +kubectl -n plgd patch job/$(kubectl -n plgd get jobs | grep mongodb-standby-tool | awk '{print $1}') --type=strategic --patch '{"spec":{"suspend":false}}' +``` + +The final step is to run plgd pods on the standby cluster. Set the `global.standby` flag to `false` and upgrade the Helm chart. + +```bash +helm upgrade -i -n plgd --create-namespace -f values.yaml --set global.standby=false hub plgd/plgd-hub +helm upgrade -i -n plgd --create-namespace -f values.yaml --set global.standby=false dps plgd/plgd-dps +``` + +After rotating the device provisioning endpoints, the devices will connect to the standby cluster. + +#### Turn Off plgd Pods on the Primary Cluster + +When the primary cluster is back up, set the `global.standby` flag to `true` and upgrade the Helm chart. + +```bash +helm upgrade -i -n plgd --create-namespace -f values.yaml --set global.standby=true hub plgd/plgd-hub +helm upgrade -i -n plgd --create-namespace -f values.yaml --set global.standby=true dps plgd/plgd-dps +``` + +### How to Switch Back to the Primary Cluster + +When the primary cluster is ready for devices, switch back to the primary cluster. + +#### Demote the Standby Cluster + +First, promote the primary cluster's MongoDB hidden members to secondary members and demote the standby cluster's MongoDB secondary members to hidden. Upgrade the Helm chart with the `mongodb.standbyTool.mode` set to `standby`. + +```bash +helm upgrade -i -n plgd --create-namespace -f values.yaml --set mongodb.standbyTool.mode=standby hub plgd/plgd-hub +``` + +Next, delete the `mongodb-standby-tool` job and resume it to configure the MongoDB replica set. + +```bash +kubectl -n plgd delete job/$(kubectl -n plgd get jobs | grep mongodb-standby-tool | awk '{print $1}') +kubectl -n plgd patch job/$(kubectl -n plgd get jobs | grep mongodb-standby-tool | awk '{print $1}') --type=strategic --patch '{"spec":{"suspend":false}}' +``` + +The final step is to run plgd pods on the standby cluster. Set the `global.standby` flag to `true` and upgrade the Helm chart. + +```bash +helm upgrade -i -n plgd --create-namespace -f values.yaml --set global.standby=true hub plgd/plgd-hub +helm upgrade -i -n plgd --create-namespace -f values.yaml --set global.standby=true dps plgd/plgd-dps +``` + +#### Turn On plgd Pods on the Primary Cluster + +When the standby cluster is ready for devices, switch back to the primary cluster. Set the `global.standby` flag to `false` and upgrade the Helm chart. + +```bash +helm upgrade -i -n plgd --create-namespace -f values.yaml --set global.standby=false hub plgd/plgd-hub +helm upgrade -i -n plgd --create-namespace -f values.yaml --set global.standby=false dps plgd/plgd-dps +``` + +After rotating the device provisioning endpoints, the devices will connect to the primary cluster.