Zookeeper clients contending for distributed lock get into deadlock state during network partition failure #229

continuum-Nikhil-Bhide · 2020-02-07T14:35:30Z

This is the edge case that causes zk clients contending for a lock to enter into stall state. This is occurring under very complex sequence of events in which network partitioning is playing a key role. Imagine that there are two zk clients contending for a lock. One zk client requests for a lock and the library successfully creates a znode for this request under the parent path; however the response is lost as suddenly there is a network partition issue.
Client library senses zk connection close exception (as socket is closed) and there is no trace of the lock. This is at the time when ZK library is waiting for the response from zk on current children under the path. Because this provides information about how many znodes (sequence numbers) are there in the priority queue and request with lowest sequence number acquires the lock. However because of server connection close issue, the library exits abruptly without populating lock_path in the lock object. However, on Zookeeper server lock gets created as new znode is successfully got created in the priority queue. This is the time at which lock has got created, but there is no trace of it. This is the cause of the deadlock.
Subsequently both the clients keep on trying to acquire a lock and it results into deadlock as for the Zookeeper cluster, a lock exists but Zookeeper clients are not aware of the same. There is no explicit unlock or session timeout because of which lock remains active. In GoLang library there is a blocking call using channel that causes both the clients to get into deadlock.

continuum-Nikhil-Bhide mentioned this issue Feb 8, 2020

added a validation to check whether lock exists in case of network pa… #230

Open

continuum-Nikhil-Bhide pushed a commit to continuum-Nikhil-Bhide/go-zookeeper that referenced this issue Feb 8, 2020

Fix issues found by deepsource.io (samuel#229)

3b97264

ishveda linked a pull request Apr 2, 2020 that will close this issue

Check and clean nodes which were created by the server, but client is not aware of it #237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zookeeper clients contending for distributed lock get into deadlock state during network partition failure #229

Zookeeper clients contending for distributed lock get into deadlock state during network partition failure #229

continuum-Nikhil-Bhide commented Feb 7, 2020

Zookeeper clients contending for distributed lock get into deadlock state during network partition failure #229

Zookeeper clients contending for distributed lock get into deadlock state during network partition failure #229

Comments

continuum-Nikhil-Bhide commented Feb 7, 2020