Skip to content

Commit

Permalink
feat: force pods with volumes to be scheduled on Cloud servers (#743)
Browse files Browse the repository at this point in the history
Due to a bug in the scheduler a node with no driver instance might be
picked and the volume is stuck in pending as the "no capacity - >
reschedule" recovery is never triggered
[[0]](kubernetes/kubernetes#122109),
[[1]](kubernetes-csi/external-provisioner#544).

- See #400

---------

Co-authored-by: lukasmetzner <lukas@metzner.io>
Co-authored-by: Julian Tölle <julian.toelle@hetzner-cloud.de>
  • Loading branch information
3 people authored Oct 29, 2024
1 parent 7211dd8 commit 702fe01
Show file tree
Hide file tree
Showing 9 changed files with 66 additions and 2 deletions.
4 changes: 4 additions & 0 deletions chart/.snapshots/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,10 @@ spec:
operator: NotIn
values:
- "true"
- key: instance.hetzner.cloud/provided-by
operator: NotIn
values:
- robot
tolerations:
- effect: NoExecute
operator: Exists
Expand Down
2 changes: 2 additions & 0 deletions chart/.snapshots/full.values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -370,3 +370,5 @@ storageClasses:
- name: foobar
defaultStorageClass: false
reclaimPolicy: Keep
allowedTopologyCloudServer: false

4 changes: 4 additions & 0 deletions chart/.snapshots/full.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,10 @@ spec:
operator: NotIn
values:
- "true"
- key: instance.hetzner.cloud/provided-by
operator: NotIn
values:
- robot
nodeSelector:
foo: bar
tolerations:
Expand Down
7 changes: 7 additions & 0 deletions chart/templates/core/storageclass.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,13 @@ provisioner: csi.hetzner.cloud
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
reclaimPolicy: {{ $val.reclaimPolicy | quote }}
{{- if $val.allowedTopologyCloudServer }}
allowedTopologies:
- matchLabelExpressions:
- key: instance.hetzner.cloud/provided-by
values:
- "cloud"
{{- end }}
---
{{- end }}
{{- end }}
6 changes: 6 additions & 0 deletions chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -556,6 +556,10 @@ node:
operator: NotIn
values:
- "true"
- key: "instance.hetzner.cloud/provided-by"
operator: NotIn
values:
- "robot"

## @param node.nodeSelector Node labels for node pods assignment
## ref: https://kubernetes.io/docs/user-guide/node-selection/
Expand Down Expand Up @@ -724,3 +728,5 @@ storageClasses:
- name: hcloud-volumes
defaultStorageClass: true
reclaimPolicy: Delete
## @param storageClass.allowedTopologyCloudServer Prevents pods from being scheduled on nodes, specifically Robot servers, where Hetzner volumes are unavailable
allowedTopologyCloudServer: false
4 changes: 4 additions & 0 deletions deploy/kubernetes/hcloud-csi.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

39 changes: 37 additions & 2 deletions docs/kubernetes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ metadata:
stringData:
encryption-passphrase: foobar
---
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
Expand Down Expand Up @@ -209,8 +209,43 @@ $ kubectl apply -f https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.

## Integration with Root Servers

Root servers can be part of the cluster, but the CSI plugin doesn't work there. Taint the root server as follows to skip that node for the DaemonSet.
Root servers can be part of the cluster, but the CSI plugin doesn't work there and the current behaviour of the scheduler can cause Pods to be stuck in `Pending`.

In the Helm Chart you can set `allowedTopologyCloudServer` to true to prevent pods from being scheduled on nodes, specifically Robot servers, where Hetzner volumes are unavailable. This value can not be changed after the initial creation of a storage class.

```yaml
storageClasses:
- name: hcloud-volumes
defaultStorageClass: true
reclaimPolicy: Delete
allowedTopologyCloudServer: true # <---
```

To ensure proper topology evaluation, labels are needed to indicate whether a node is a cloud VM or a dedicated server from Robot. If you are using the `hcloud-cloud-controller-manager` version 1.21.0 or later, these labels are added automatically. Otherwise, you will need to label the nodes manually.

### Adding labels manually

**Cloud Servers**
```bash
kubectl label nodes <node name> instance.hetzner.cloud/provided-by=cloud
```

**Root Servers**
```bash
kubectl label nodes <node name> instance.hetzner.cloud/provided-by=robot
```


### DEPRECATED: Old Label

We prefer that you use our [new label](#new-label). The label `instance.hetzner.cloud/is-robot-server` will be deprecated in future releases.

**Cloud Servers**
```bash
kubectl label nodes <node name> instance.hetzner.cloud/is-root-server=false
```

**Root Servers**
```bash
kubectl label nodes <node name> instance.hetzner.cloud/is-root-server=true
```
Expand Down
1 change: 1 addition & 0 deletions internal/driver/driver.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ const (
DefaultVolumeSize = MinVolumeSize

TopologySegmentLocation = PluginName + "/location"
ProvidedByLabel = "instance.hetzner.cloud/provided-by"
)
1 change: 1 addition & 0 deletions internal/driver/node.go
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ func (s *NodeService) NodeGetInfo(_ context.Context, _ *proto.NodeGetInfoRequest
AccessibleTopology: &proto.Topology{
Segments: map[string]string{
TopologySegmentLocation: s.serverLocation,
ProvidedByLabel: "cloud",
},
},
}
Expand Down

0 comments on commit 702fe01

Please sign in to comment.