-
Notifications
You must be signed in to change notification settings - Fork 672
Commandeer#188 Update DNS Lookup Behavior #191
base: master
Are you sure you want to change the base?
Conversation
zk/dnshostprovider_test.go
Outdated
} | ||
hp.mu.Unlock() | ||
|
||
time.Sleep(time.Millisecond * 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[EDIT] sleeps like this can lead to test flakes in busy CI envs. consider implementing a retry loop that fails after a longer period of non-success (like 30s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review @jdef . I've updated the commit.
zk/dnshostprovider_test.go
Outdated
// Starts a 30s retry loop to wait servers list to be updated | ||
startRetryLoop := time.Now() | ||
for { | ||
hp.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add back the sleep as the first step of this for loop, to avoid pegging the CPU. sleeps are useful in tests like these, they just need to be wrapped in a loop like this to avoid CI flakes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Fixed.
zk/dnshostprovider.go
Outdated
host, port, err := net.SplitHostPort(server) | ||
if err != nil { | ||
return err | ||
return false, err | ||
} | ||
addrs, err := lookupHost(host) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more suggestion: lookupHost may take a while. you might not want to hold the lock while it's executing. consider releasing it, and re-locking once it returns (if there's more work to do)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess you could just hold off on obtaining the lock until you start manipulating shared state (starting with hp.servers = append ...
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a very good suggestion @jdef . However this optimization makes unresolvedServers
, sleep
, and lookupHost
not thread safe. It is actually acceptable, since they are only used during lookup.
I updated the commit based on your suggestion, and also the mutex now only protects servers
, curr
, and last
.
Also updated the test to use channel instead to simulate a server that gets back online later. Previously I changed lookupHost
inflight which would be a race in the new implementation.
Fixed all the comments. @jdef |
This PR commandeer #188 and finishes the remaining work.
When some of servers fail
net.lookup
, it used to error out.Instead we should consider it as success as long as any one server is alive, and start a separate go-routine to keep trying the lookup and if successful add server to the list.