Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azuredisk-node-win fails to mount disk: requested access path is already in use #2690

Closed
ps610 opened this issue Dec 4, 2024 · 10 comments · Fixed by #2691 or #2699
Closed

azuredisk-node-win fails to mount disk: requested access path is already in use #2690

ps610 opened this issue Dec 4, 2024 · 10 comments · Fixed by #2691 or #2699

Comments

@ps610
Copy link

ps610 commented Dec 4, 2024

What happened:
We have a cluster with several Windows nodes (2022), on which Windows pods are executed depending on the demand of our users. The Windows (application) pods are Microsoft Business Central Containers with at least one volume (PVC) containing the application's database. (see base/example helm chart)

In times with high demand and therefore the parallel start of many pods, it happens sporadically that the pod's volume cannot be mounted, which means that the pod cannot start and remains in the “Containercreating” state. As a workaround, the “stuck” pod can be deleted manually and it will then work for an automatically recreated pod (based on the deployment).

The error first appeared under Kubernetes version 1.28.5, we upgraded via 1.29.9 to 1.30.5 yesterday in the hope that this would fix the problem. But in fact it seems to occur more frequently unfortunately, as in the past roughly 2% of starting pods were affected, but this morning almost 10%.

Error from csi-azuredisk-node-win log:

I1204 07:21:16.053429    7932 nodeserver.go:157] NodeStageVolume: formatting 7 and mounting at \var\lib\kubelet\plugins\kubernetes.io\csi\disk.csi.azure.com\3b03e9b721efa805aa50589f1531a282237faef0f18d6d7d05f21d77c63faf9d\globalmount with mount options([])
I1204 07:21:20.790333    7932 disk.go:363] Disk 7 already initialized
I1204 07:21:22.120014    7932 disk.go:380] Disk 7 already partitioned
E1204 07:21:26.970044    7932 utils.go:110] GRPC error: rpc error: code = Internal desc = could not format 7(lun: 6), and mount it at \var\lib\kubelet\plugins\kubernetes.io\csi\disk.csi.azure.com\3b03e9b721efa805aa50589f1531a282237faef0f18d6d7d05f21d77c63faf9d\globalmount, failed with error mount volume to path. cmd: Get-Volume -UniqueId "$Env:volumeID" | Get-Partition | Add-PartitionAccessPath -AccessPath $Env:path, output: Add-PartitionAccessPath : The requested access path is already in use.
Activity ID: {4a3c6ee8-5e15-4807-9501-645caf6e96ef}
At line:1 char:56
+ ... meID" | Get-Partition | Add-PartitionAccessPath -AccessPath $Env:path
+                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (StorageWMI:ROOT/Microsoft/.../MSFT_Partition) [Add-PartitionAccessPath 
   ], CimException
    + FullyQualifiedErrorId : StorageWMI 42002,Add-PartitionAccessPath
 
, error: exit status 1
I1204 07:25:42.746614    7932 utils.go:105] GRPC call: /csi.v1.Node/NodeStageVolume
I1204 07:25:42.746669    7932 utils.go:106] GRPC request: {"publish_context":{"LUN":"6"},"staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\disk.csi.azure.com\\6361b07d0c9f08393f72efed6acbfd87e35c6af82b85ee6fa4a7433d030eb191\\globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-4d1158ff-928f-4bc4-92ec-e8fe6c153a35","csi.storage.k8s.io/pvc/name":"f053192c13d0-business-central-db","csi.storage.k8s.io/pvc/namespace":"cust-demue-gmbh","requestedsizegib":"15","skuname":"Premium_ZRS","storage.kubernetes.io/csiProvisionerIdentity":"1731396534581-6144-disk.csi.azure.com"},"volume_id":"/subscriptions/4abe427c-4d7a-47be-be27-6631e9a2d5ad/resourceGroups/mc_cosmo-alpaca-aks_cluster_westeurope/providers/Microsoft.Compute/disks/pvc-4d1158ff-928f-4bc4-92ec-e8fe6c153a35"}

What you expected to happen:
Mounting the volume always works and the pods are able to start.

How to reproduce it:
Can't provide a reproduction scenarios as it happens very sporadically and in situations with high load.

Anything else we need to know?:
We're using autoscaling for our node pools. The application pods may automatically be removed at the end of the day (depending on the users needs) and only the volume (with the database) is kept. The next day, a new pod can/will be created with the existing volume attached. Which means for the user, it is the "same" environment.

Environment:

  • CSI Driver version: v1.30.5-windows-hp
  • Kubernetes version (use kubectl version): v1.30.5
  • OS (e.g. from /etc/os-release): Windows2022
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@andyzhangx
Copy link
Member

@ps610 what is your windows vm sku? is it hyperv Gen2 VM?

@ps610
Copy link
Author

ps610 commented Dec 4, 2024

@ps610 what is your windows vm sku? is it hyperv Gen2 VM?

Hi @andyzhangx,

We are running on Standard_E8ds_v5 VMs

@andyzhangx
Copy link
Member

andyzhangx commented Dec 4, 2024

does the same disk volume mounted and unmounted on the node frequently?

you could run kubectl exec -it -n kube-system csi-azuredisk-node-win-xxx -c azuredisk -- cmd and then Check if the requested access path is already in use by running the following command in PowerShell:

(Get-Disk -Number 2 | Get-Partition | Get-Volume).UniqueId
\\?\Volume{c00607ef-8189-4e45-8e78-7b97c3d2d158}\

Get-Volume -UniqueId "\\?\Volume{c00607ef-8189-4e45-8e78-7b97c3d2d158}\" | Get-Partition

@andyzhangx
Copy link
Member

andyzhangx commented Dec 4, 2024

nvm, this PR should fix the issue: #2691, this is the testing image: mcr.microsoft.com/k8s/csi/azuredisk-csi:v1.32.0-windows-hp which contains the fix

@andyzhangx
Copy link
Member

root cause is that the first disk format process costing more than 2min, thus timeout, and then another mount process is called, so you would hit this error, I could sometimes repro in e2e tests.

I1205 08:54:47.781411    3328 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I1205 08:54:47.781949    3328 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\disk.csi.azure.com\\09d7f61f6574d352182e825eb461d54e98aac258ad9005f617b2edef1bd57db2\\globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-3d12ae9e-6da8-433b-8db5-aaf16147b078","csi.storage.k8s.io/pvc/name":"pvc-qgwvr","csi.storage.k8s.io/pvc/namespace":"azuredisk-655","requestedsizegib":"10","skuName":"StandardSSD_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1733386384086-4819-disk.csi.azure.com"},"volume_id":"/subscriptions/46678f10-4bbb-447e-98e8-d2829589f2d8/resourceGroups/capz-63fp7f/providers/Microsoft.Compute/disks/pvc-3d12ae9e-6da8-433b-8db5-aaf16147b078"}

I1205 08:55:14.122188    3328 nodeserver.go:157] NodeStageVolume: formatting 2 and mounting at \var\lib\kubelet\plugins\kubernetes.io\csi\disk.csi.azure.com\09d7f61f6574d352182e825eb461d54e98aac258ad9005f617b2edef1bd57db2\globalmount with mount options([])

I1205 08:55:34.308557    3328 disk.go:356] Initializing disk 2
I1205 08:55:34.323338    3328 azure_disk_utils.go:863] Executing command: "C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command Get-Disk -Number 4 | Where partitionstyle -eq 'raw'"


I1205 08:55:55.165024    3328 disk.go:373] Creating basic partition on disk 2
I1205 08:55:55.173003    3328 azure_disk_utils.go:863] Executing command: "C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command Get-Partition | Where DiskNumber -eq 4 | Where Type -ne Reserved"

I1205 08:56:47.790793    3328 nodeserver.go:161] NodeStageVolume: format 2 and mounting at \var\lib\kubelet\plugins\kubernetes.io\csi\disk.csi.azure.com\09d7f61f6574d352182e825eb461d54e98aac258ad9005f617b2edef1bd57db2\globalmount successfully.


I1205 08:56:53.665838    3328 nodeserver.go:157] NodeStageVolume: formatting 2 and mounting at \var\lib\kubelet\plugins\kubernetes.io\csi\disk.csi.azure.com\09d7f61f6574d352182e825eb461d54e98aac258ad9005f617b2edef1bd57db2\globalmount with mount options([])


I1205 08:57:11.965688    3328 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I1205 08:57:11.965688    3328 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\disk.csi.azure.com\\09d7f61f6574d352182e825eb461d54e98aac258ad9005f617b2edef1bd57db2\\globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-3d12ae9e-6da8-433b-8db5-aaf16147b078","csi.storage.k8s.io/pvc/name":"pvc-qgwvr","csi.storage.k8s.io/pvc/namespace":"azuredisk-655","requestedsizegib":"10","skuName":"StandardSSD_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1733386384086-4819-disk.csi.azure.com"},"volume_id":"/subscriptions/46678f10-4bbb-447e-98e8-d2829589f2d8/resourceGroups/capz-63fp7f/providers/Microsoft.Compute/disks/pvc-3d12ae9e-6da8-433b-8db5-aaf16147b078"}

E1205 08:57:10.860785    3328 utils.go:82] GRPC error: rpc error: code = Internal desc = could not format 2(lun: 0), and mount it at \var\lib\kubelet\plugins\kubernetes.io\csi\disk.csi.azure.com\09d7f61f6574d352182e825eb461d54e98aac258ad9005f617b2edef1bd57db2\globalmount, failed with error mount volume to path. cmd: Get-Volume -UniqueId "$Env:volumeID" | Get-Partition | Add-PartitionAccessPath -AccessPath $Env:path, output: Add-PartitionAccessPath : The requested access path is already in use.
Activity ID: {2c1daab0-451c-48eb-8168-03bdd0b97b01}
At line:1 char:56
+ ... meID" | Get-Partition | Add-PartitionAccessPath -AccessPath $Env:path
+                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (StorageWMI:ROOT/Microsoft/.../MSFT_Partition) [Add-PartitionAccessPath 
   ], CimException
    + FullyQualifiedErrorId : StorageWMI 42002,Add-PartitionAccessPath

@andyzhangx andyzhangx reopened this Dec 5, 2024
@lippertmarkus
Copy link

Is that fixed with your PRs? Just asking because you reopened this issue here

@andyzhangx
Copy link
Member

Is that fixed with your PRs? Just asking because you reopened this issue here

@lippertmarkus not yet, but I found how to fix it, stay tuned.

@sixeyed
Copy link

sixeyed commented Dec 5, 2024

Adding that we have the same issue with Windows 2019 nodes on AKS 1.29. The Pods are created from a KEDA ScaledJob and they use an Azure Disk ephemeral volume. We are not re-using paths, but the node pool is using deallocate scale-down mode.
@andyzhangx - happy to test a patched image, we can reproduce this easily.

@ps610
Copy link
Author

ps610 commented Dec 9, 2024

Thank you, @andyzhangx.
We use managed AKS (currently on 1.30.5), to which version do we have to upgrade to receive your fix?

@andyzhangx
Copy link
Member

Thank you, @andyzhangx. We use managed AKS (currently on 1.30.5), to which version do we have to upgrade to receive your fix?

@ps610 I will publish new csi driver version this week, pls email me your aks cluster fqdn, I will upgrade your csi driver version on Windows directly after new version release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants