Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

conn: fix racy access to result fields when connection is closing #4

Merged
merged 1 commit into from
Aug 22, 2018

Conversation

jdef
Copy link

@jdef jdef commented Aug 15, 2018

With the introduction of #3 there's now a potential race, at connection close time, between the "send handler" that mutates response objects, and the return statements of top-level API calls that read response fields. This PR checks for the race and returns a "connection closing" error in such cases.

@jdef jdef requested a review from gpaul August 15, 2018 23:25
@mhrabovcin
Copy link

As suggested online @jdef . Is there a possibility to put this single select statement into request method?

I can see the need to add special error handler here https://github.com/mesosphere/go-zookeeper/pull/4/files#diff-5f455e42d3680fdc56c7076dc0c182f3R1332 but the rest of the code just passes error to the caller.

@jdef
Copy link
Author

jdef commented Aug 21, 2018

@mhrabovcin thanks for the review. I double-checked the changes this morning. All of the funcs I changed actually DO need to be changed, somehow, in order to avoid the data race. This is because statements like return foo, bar, &res.Stat, err can blow up because of the &res.Stat access that might race with an access of the same field by the async request/response processing machinery. Such access happens because in queueRequest, there's the chance that a request will be queued and processed, even if shouldQuit has closed.

As an alternative to the repeated select blocks in each of the API funcs, I could (a) modify request as you've suggested, and ALSO then; (b) replace the newly added select blocks with an if err == ErrConnectionClosed check that performs an early return to avoid accessing res field members. Would you prefer this approach instead?

@mhrabovcin
Copy link

@jdef thanks for the detailed explanation. I don't have a strong preference here. From the code readability perspective I think it would be nice to handle this case with the single select statement and a comment that would explain why is this necessary. All the callers can then handle ErrConnectionClosed depending on data use.

I am happy to approve the code as it is!

@jdef
Copy link
Author

jdef commented Aug 22, 2018

@mhrabovcin PTAL

Copy link

@mhrabovcin mhrabovcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@jdef
Copy link
Author

jdef commented Aug 22, 2018

squashed prior to merge

@jdef
Copy link
Author

jdef commented Aug 22, 2018

possibly racy test?

--- FAIL: TestDNSHostProviderReconnect (2.49s)
	cluster_test.go:15: [ZKERR] Picked up _JAVA_OPTIONS: -Xmx2048m -Xms512m
	cluster_test.go:15: [ZKERR] Picked up _JAVA_OPTIONS: -Xmx2048m -Xms512m
	cluster_test.go:15: [ZKERR] Picked up _JAVA_OPTIONS: -Xmx2048m -Xms512m
	dnshostprovider_test.go:130: Connected to "localhost:11657". Finding test server index…
	dnshostprovider_test.go:134: …trying "localhost:11657"
	dnshostprovider_test.go:137: …found at index 0
	dnshostprovider_test.go:151: Create returned error: zk: connection closed
	cluster_test.go:15: [ZKERR] Picked up _JAVA_OPTIONS: -Xmx2048m -Xms512m

[EDIT]
Looks like an unrelated test flake. code in zk/conn.go:loop() overwrites the value of err with ErrConnectionClosed upon disconnection. the test in question intentionally restarts a server, so my assumption is that the zk client code is processing a disconnect when the Create() call is attempted and it receives the reported error because flushRequests() is concurrently executing.

@jdef jdef merged commit 5daf934 into d2iq-archive:master Aug 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants