Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed provisioning: optimize PV informer #712

Closed
pohly opened this issue Mar 9, 2022 · 10 comments
Closed

distributed provisioning: optimize PV informer #712

pohly opened this issue Mar 9, 2022 · 10 comments
Assignees

Comments

@pohly
Copy link
Contributor

pohly commented Mar 9, 2022

While running clusterloader2 to evaluate performance of volume provisioning with storage capacity tracking (see https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/docs/storage-capacity-tracking.md) I noticed that volume deletion was slow - much slower than creation of volumes. This occurred in particular for high node counts (100 in that case).

The initial suspect was overhead for the PV watches (every PV sent to every external-provisioner instance), but further experimentation didn't confirm that. Instead, it turned out that Pod deletion was slow.

Nonetheless, enhancing the PV watch seems doable:

  • filter PV in the apiserver (ideal solution, because filtering in the apiserver is cheaper than sending out updates to clients which don't need them)
  • filter out foreign PVs before adding them to the workqueue (tried, didn't help much)

/assign

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

Example: 10 nodes on Azure cloud, csi-driver-host-path master, with --v=3 and WaitForFirstConsumer disabled in deploy/kubernetes-distributed/hostpath/csi-hostpath-storageclass-fast.yaml.

go run cmd/clusterloader.go -v=2 --report-dir=/tmp/clusterloader2-with-volume-10-nodes-capacity --kubeconfig=/home/pohly/.kube/config --provider=local --nodes=10 --testconfig=testing/experimental/storage/pod-startup/config.yaml --testoverrides=testing/experimental/storage/pod-startup/volume-types/persistentvolume/override.yaml --testoverrides=with-volume.yaml

with-volume.yaml:

PODS_PER_NODE: 20
NODES_PER_NAMESPACE: 10
VOLUMES_PER_POD: 1
VOL_SIZE: 5Gi
STORAGE_CLASS: csi-hostpath-fast
GATHER_METRICS: false
POD_STARTUP_TIMEOUT: 45m

Result:

<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="240.468">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="240.463977815"></testcase>
      <testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="20.126110888"></testcase>
      <testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="90.064074266"></testcase>
      <testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.200939786"></testcase>
      <testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="20.122103454"></testcase>
      <testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="34.388961952"></testcase>
      <testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="20.12327411"></testcase>
      <testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="20.125988425"></testcase>
      <testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="10.103055025"></testcase>

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

Logs show:

I0309 08:54:27.365969       1 controller.go:1477] delete "pvc-59d26170-b93c-417b-ae65-4d03c0f495ac": volume deletion ignored: ignored because PV was not provisioned on this node

This is similar to issue #669, but not quite the same,

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

Server-side filtering is not possible at the moment. Here's what a PV looks like:

- apiVersion: v1
  kind: PersistentVolume
  metadata:
    annotations:
      pv.kubernetes.io/provisioned-by: hostpath.csi.k8s.io
    creationTimestamp: "2022-03-09T08:23:03Z"
    finalizers:
    - kubernetes.io/pv-protection
    managedFields:
      ...
    name: pvc-02b4a075-80b9-47c8-b646-e2ca6ac5386c
    resourceVersion: "14239"
    uid: 5f8572c7-19c7-4de1-a39b-1c8a05feabab
  spec:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 5Gi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: vol-138
      namespace: test-9j4yhq-1
      resourceVersion: "12755"
      uid: 02b4a075-80b9-47c8-b646-e2ca6ac5386c
    csi:
      driver: hostpath.csi.k8s.io
      volumeAttributes:
        kind: fast
        storage.kubernetes.io/csiProvisionerIdentity: 1646814048800-8081-hostpath.csi.k8s.io-aks-nodepool1-35404716-vmss000001
      volumeHandle: 233a60c8-9f82-11ec-8b2c-129d02cf3a9c
    nodeAffinity:
      required:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.hostpath.csi/node
            operator: In
            values:
            - aks-nodepool1-35404716-vmss000001
    persistentVolumeReclaimPolicy: Delete
    storageClassName: csi-hostpath-fast
    volumeMode: Filesystem
  status:
    phase: Bound

We would need to filter by nodeAffinity, but that is not supported by https://github.com/kubernetes/kubernetes/blob/9946b5364e8199ac832161e17155bd005c391fc5/pkg/registry/core/persistentvolume/strategy.go#L162-L179

external-provisioner could set a label. As there may be existing PVs which don't have it yet, we would have to extend external-provisioner (or some other component) so that it can be deployed in the control plane without a CSI driver and then adds the label to those PVs which don't have it. May also be necessary when users (accidentally?) remove the label. This feels complicated, perhaps too complicated to be worth it.

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

Implementing the pre-workqueue filtering is possible. For 10 nodes it doesn't make a big difference:

<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="237.878">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="237.874646435"></testcase>
      <testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="20.122253052"></testcase>
      <testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="85.067009928"></testcase>
      <testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.100423552"></testcase>
      <testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="20.127212681"></testcase>
      <testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="36.891739215"></testcase>
      <testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="20.119977666"></testcase>
      <testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="20.116018531"></testcase>
      <testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="10.1019071"></testcase>

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

No big difference for 10 volumes per node and 100 nodes either:

$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="662.946">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="662.941836976"></testcase>
      <testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="100.612544663"></testcase>
      <testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="50.166796847"></testcase>
      <testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.101135165"></testcase>
      <testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="100.613237416"></testcase>
      <testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="119.493390436"></testcase>
      <testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="100.638563713"></testcase>
      <testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="100.610427034"></testcase>
      <testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="75.117069732"></testcase>
  </testsuite>

$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity-filtering/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="663.363">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="663.358644866"></testcase>
      <testcase name="storage: [step: 01] Provisioning volumes" classname="ClusterLoaderV2" time="100.611435332"></testcase>
      <testcase name="storage: [step: 02] Waiting for PVs to be bound" classname="ClusterLoaderV2" time="50.167335721"></testcase>
      <testcase name="storage: [step: 03] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.101097687"></testcase>
      <testcase name="storage: [step: 04] Creating deployments" classname="ClusterLoaderV2" time="100.593417364"></testcase>
      <testcase name="storage: [step: 05] Waiting for deployments to be running" classname="ClusterLoaderV2" time="115.017417672"></testcase>
      <testcase name="storage: [step: 06] Deleting deployments" classname="ClusterLoaderV2" time="100.602359492"></testcase>
      <testcase name="storage: [step: 07] Deleting volumes" classname="ClusterLoaderV2" time="100.621163905"></testcase>
      <testcase name="storage: [step: 08] Waiting for volume deletion" classname="ClusterLoaderV2" time="75.117128046"></testcase>

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

20 volumes per node, no big difference either:

$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="1343.606">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="1343.602389857"></testcase>
      <testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.1010827"></testcase>
      <testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="201.216153546"></testcase>
      <testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="234.918855189"></testcase>
      <testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="201.203148744"></testcase>
      <testcase name="storage: [step: 05] Waiting for volume deletion" classname="ClusterLoaderV2" time="680.596455696"></testcase>
  </testsuite>
  
$ cat /tmp/clusterloader2-with-volume-100-nodes-capacity-filtering/junit.xml
<?xml version="1.0" encoding="UTF-8"?>
  <testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="1353.62">
      <testcase name="storage overall (testing/experimental/storage/pod-startup/config.yaml)" classname="ClusterLoaderV2" time="1353.616891482"></testcase>
      <testcase name="storage: [step: 01] Starting measurement for waiting for deployments" classname="ClusterLoaderV2" time="0.101016257"></testcase>
      <testcase name="storage: [step: 02] Creating deployments" classname="ClusterLoaderV2" time="201.222271449"></testcase>
      <testcase name="storage: [step: 03] Waiting for deployments to be running" classname="ClusterLoaderV2" time="234.712012496"></testcase>
      <testcase name="storage: [step: 04] Deleting deployments" classname="ClusterLoaderV2" time="201.198080417"></testcase>
      <testcase name="storage: [step: 05] Waiting for volume deletion" classname="ClusterLoaderV2" time="680.621806619"></testcase>

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

Perhaps the problem is elsewhere: while testing with 50 pods per node, the test got all of them running and then deleted all deployments that own these pods, but the pods kept running. Not sure whether that is supposed to happen.

@pohly
Copy link
Contributor Author

pohly commented Mar 9, 2022

It looks like deleting Pods was slow. The Deployments and Replicasets were gone, but Pods lingered for an extended period of time.

@pohly pohly changed the title distributed provisioning: slow volume deletion distributed provisioning: optimize PV informer Mar 11, 2022
@pohly
Copy link
Contributor Author

pohly commented Apr 8, 2022

There's no clear need to optimize this. Let's close for now.

/close

@k8s-ci-robot
Copy link
Contributor

@pohly: Closing this issue.

In response to this:

There's no clear need to optimize this. Let's close for now.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants