GKE - Restore Gets Stuck due to Cluster Reconciliation #3522

codeviking · 2021-03-04T18:07:47Z

codeviking
Mar 4, 2021

I'm using Velero with a GKE cluster with a moderate number of workloads (~130 namespaces, where each has ~2 deployments with a couple of replicas per each).

I'm trying to use Velero to do a full restore of the cluster on a newly provisioned GKE cluster. The first time I attempt the restore it always fails and ends up stuck in the InProgress state. This occurs because the GKE cluster enters the RECONCILING state while the backup is occurring, which causes the Kube API to temporarily become inaccessible (which in turn causes the restore to fail).

I can "fix" this by executing the same restore after the cluster exits the RECONCILING state, but I'm left with a Velero restore that's forever stuck in the InProgress state. Maybe that's ok and I should just ignore it, but I don't like leaving things in an indeterminate state 😅.

I'm curious if others are running into this and if there's a workaround. I also wonder if this could be solved by adding a velero restore resume command, which would allow an operator to manually restart a restore. In my mind it's ok if it simply starts from the beginning since Velero is good about not overwriting things that already exist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE - Restore Gets Stuck due to Cluster Reconciliation #3522

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

GKE - Restore Gets Stuck due to Cluster Reconciliation #3522

codeviking Mar 4, 2021

Replies: 0 comments

codeviking
Mar 4, 2021