-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi: restart LND Client on setup failure #694
multi: restart LND Client on setup failure #694
Conversation
083a91e
to
fbfd1d8
Compare
fbfd1d8
to
e73cfb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently when failing to set up and therefore now retrying to create the lnd client(s), the litcli getinfo + the litcli stop commands are blocked and are not allowed, as we currently require macaroon authentication for those calls. Should I update this PR to allow those calls by adding them to the LiT Whitelist?
We cannot whitelist calls like litcli stop
since that means that anyone can come along and shutdown our node without a macaroon.
I think the question is really: should we consider completely decoupling the Lit macaroon DB from LND's macaroon DB? Since then we can start it up without a dependence on LND.
I think this is worth considering since in future we might want LiT to connect to multiple LND nodes anyway and in that future we would definitely want LiT to have a decoupled macaroon DB.
EDIT: spoke to Oli about the above - for now, it's likely fine to just not allow the litcli stop & getinfo
calls - users will still have access to litcli status
and they will be able to use CTRL+C
to trigger a shutdown.
Nice PR! left some suggestions.
status/manager.go
Outdated
@@ -19,6 +19,10 @@ type SubServerStatus struct { | |||
// been started. | |||
Running bool | |||
|
|||
// Status will be a non-empty string that details the current status of | |||
// the sub-server. | |||
Status string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think we should add detail to the comment regarding how this differs from Disabled/Running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also - do we need to have it duplicate info like "Registered", "Running", "Stopped" and "Errored"? since all of those already have booleans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the comment :)! The reason I currently have it duplicate info for "Registered", "Running", "Stopped" and "Errored", is that I thought it probably will be even more confusing for the user that it will else be blank from time to time in the response for the litcli status
cmd. If you think that I should remove it I will though :). Else I think a perhaps better solution would have been to remove the booleans for the litrpc.SubServerStatus
struct and only keep this new status field instead, but I'm not a huge fan of removing them for backwards compatibility reasons..
Let me know what you'd prefer :)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'd say that since it is called "manual status" it should be clear to the user that this is something special - so im thinking that we could probably just overwrite it to an empty string if any other status gets set.
subservers/manager.go
Outdated
@@ -52,7 +52,7 @@ func NewManager(permsMgr *perms.Manager, | |||
// AddServer adds a new subServer to the manager's set. | |||
func (s *Manager) AddServer(ss SubServer, enable bool) { | |||
// Register all sub-servers with the status server. | |||
s.statusServer.RegisterSubServer(ss.Name()) | |||
s.statusServer.RegisterSubServer(ss.Name(), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now about instead of changing the API here, we instead have functional options? that way we dont have to pass in nil
for something that will be nil
90% of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ie, something like this:
type SubServerOption func(status *subServer)
// WithIsReadyOverride is a functional option that can be used to set a call
// back function that is used to check if a system is ready _iff_ the system
// running status is not yet true. The call-back will be passed the request URI
// along with any manual status that has been set for the subsystem.
func WithIsReadyOverride(fn func(string, string) (bool, bool)) SubServerOption {
return func(status *subServer) {
status.isReadyOverride = fn
}
}
// RegisterSubServer will create a new sub-server entry for the Manager to
// keep track of.
func (s *Manager) RegisterSubServer(name string, opts ...SubServerOption) {
s.mu.RLock()
defer s.mu.RUnlock()
s.registerSubServerUnsafe(name, true, opts...)
}
// RegisterAndEnableSubServer will create a new sub-server entry for the
// Manager to keep track of and will set it as enabled.
func (s *Manager) RegisterAndEnableSubServer(name string,
opts ...SubServerOption) {
s.mu.RLock()
defer s.mu.RUnlock()
s.registerSubServerUnsafe(name, false, opts...)
}
func (s *Manager) registerSubServerUnsafe(name string, disabled bool,
opts ...SubServerOption) {
ss := newSubServerStatus(disabled)
for _, o := range opts {
o(ss)
}
s.subServers[name] = ss
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then, IsSystemReady
can be changed to:
func (s *Manager) IsSystemReady(name, req string) (bool, bool, error) {
s.mu.RLock()
defer s.mu.RUnlock()
server, ok := s.subServers[name]
if !ok {
return false, false, errors.New("a sub-server with " +
"name %s has not yet been registered")
}
if server.disabled {
return false, true, nil
}
if server.running {
return true, false, nil
}
if server.isReadyOverride == nil {
return false, false, nil
}
isReady, handled := server.isReadyOverride(req, server.manualStatus)
if handled {
return false, false, nil
}
return isReady, false, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then the code in terminal.go
becomes:
lndOverride := func(uri, manualStatus string) (bool, bool) {
if uri != "/lnrpc.State/GetState" {
return false, false
}
return manualStatus == lndWalletReadyStatus, true
}
// Register LND, LiT and Accounts with the status manager.
g.statusMgr.RegisterAndEnableSubServer(
subservers.LND, status.WithIsReadyOverride(lndOverride),
)
g.statusMgr.RegisterAndEnableSubServer(subservers.LIT)
g.statusMgr.RegisterSubServer(subservers.ACCOUNTS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the feedback @ellemouton 🔥🚀!! I've updated the PR to implement the above, with smaller modifications.
Definitely agree that the new version is much cleaner!
9e605f0
to
42c7916
Compare
Thanks a lot for the feedback and for reviewing this so quickly @ellemouton! I've addressed your feedback with the latest push. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @ViktorTigerstrom ⭐
This is basically there 👍 last major comment is regarding preventing the loops from spinning - once that is addressed - I'll do a final testing round
status/manager.go
Outdated
var isReady, handled bool | ||
|
||
if server.isReadyOverride != nil { | ||
isReady, handled = server.isReadyOverride( | ||
req, server.manualStatus, | ||
) | ||
} | ||
|
||
if !handled || server.Running { | ||
return server.Running, false, nil | ||
} | ||
|
||
return isReady, false, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a bit hard to read.
How about:
if server.disabled {
return false, true, nil
}
// If this system has an override check set, then we first check that
// to see if it overrides the "ready" status of the system.
if server.isReadyOverride != nil {
isReady, handled := server.isReadyOverride(
req, server.manualStatus,
)
if handled {
return isReady, false, nil
}
}
return server.running, false, nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've updated to your version, with a slight modification to the first if check as we also need to ensure that the system isn't marked as running
when we let the override function define if the request should be accepted or not. Hope that didn't ruin the readability too much :)!
42c7916
to
3c325b3
Compare
Thanks again for reviewing this PR so quickly @ellemouton 🚀🔥!!! I've addressed your latest feedback with the latest push :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice change, will be very helpful to users to know what's going on under the hood!
Nothing major from my end, just a couple of comments around the re-try loop itself.
case <-interceptor.ShutdownChannel(): | ||
return fmt.Errorf("received the shutdown signal") | ||
|
||
case <-time.After(defaultStartupTimeout): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want this to be the maximum time we wait in total, then we need to create the time.After()
channel outside of the callback. Otherwise we wait this long for every re-try (since we create a new ticker channel each time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do we want to wait 5 seconds every time? Wasn't super clear to me from the comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do we want to wait 5 seconds every time? Wasn't super clear to me from the comment above.
Yeah this was my intention, so that we don't spin as per @ellemouton's comment #694 (comment), but still try to reinitiate the lndclient(s) on failure to set them up.
Insecure: insecure, | ||
CustomMacaroonPath: macPath, | ||
CustomMacaroonHex: hex.EncodeToString(macData), | ||
BlockUntilChainSynced: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are passing in BlockUntilChainSynced: true
here, we don't actually need to re-try ourselves, as that will already initiate a re-try loop, as long as the ctxc
context we pass in doesn't expire too early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes for the BlockUntilUnlocked: true
I guess. So maybe we could also just try to initiate the full client first, as that already does all the re-trying we want and need and then can initiate the basic client afterwards where we can be pretty certain that it should succeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what we lose that way would be the extra info exposed to the user with the SetErrored
field. So I'm fine with leaving the order the way it is since it gives the user more information. But I'm fairly certain we don't need the re-try for
loop here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are passing in BlockUntilChainSynced: true here, we don't actually need to re-try ourselves, as that will already initiate a re-try loop, as long as the ctxc context we pass in doesn't expire too early.
So unfortunately this isn't really the case, and logs we have seen from users, such as @Liongrass, have had the ctxc
time out during the chain sync of lnd
, when lnd
took for example 2mins+ to successfully sync. The GetInfo
call that's made during the chain sync in the lndclient
then times out, and therefore the set up of the full client fails. If we would have retried to set up the full lndclient
during those cases, the lnd
sub-server in litd would have eventually successfully been started.
One potential alternative solution could have been to increase the ctxc
timeout, but as there's no specific context just for the chain sync, that would mean that any future RPC call made after the chain sync would also use the updated timeout, which wouldn't be ideal..
This is why we decided to take the approach to attempt to set up the lndclient
indefinitely until it is successful as a temporary fix, until we implement a more general solution that restarts litd
sub-servers when required to restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, good point. Makes sense.
@ellemouton: review reminder |
3c325b3
to
dfd190a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Contingent on what Oli thinks re the retry loops :)
Btw - think you need to update your lndclient test branch as the current redirect commit doesnt compile.
Notes from testing:
- saw this getting logged:
Setting the lit sub-server as errored: Error %d when creating LND Services client: %v
- gotta make sure that the values are populated in the log. But maybe this is just a factor of the test? Also saw this:
STAT: Setting the lit sub-server as errored: Error when setting up basic LND Client: %v
- this is probably not related to this PR but I get this error for taproot-assets:
"error": "unable to initialize RPC server: unable to start asset custodian: error listing existing addresses: unknown sqlite error: SQL logic error: no such column: asset_version (1)"
have you seen this before? perhaps something we should report to the assets team?
Refactor Register & Enable functions to enable setting of SubServerOptions, i.e. functional options, when initiating a SubServer.
isReadyOverride is a callback that, when set and only if `running` is not yet true, will be used to determine if a system is ready for a call. We will pass the request URI to this method along with the `manualStatus`. The first returned boolean is true if the system should be seen as ready and the second is true if the override does handle the given request. If it does not, then we will fall back to our normal is-ready check.
Ensure that the sub-server error string is set to an empty string when marking a sub-server as running. This will be important for the next commit, which will retry to start the lnd sub-server even if we error when starting the lnd-client(s), and therefore may mark the lnd sub-server as running even after we've errored when starting the lnd-client(s).
dfd190a
to
fcd22a7
Compare
Thanks a lot for the review @ellemouton 🎉!! I've addressed your feedback with the latest push!
I've now updated the branch so that it builds and can be used for testing! One thing to note though, Is that I couldn't get it to work with lndclient
Thanks! This was a pre-existing error that I've now addressed with 3b83d6e :)
I hadn't seen this, but as I started downgrading the lndclient and therefore the taproot-assets version to make the test branch compile, and then upgraded again and such, I actually managed to trigger this as well! Did you do something similar before the error occurred for you as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me, very nice work!
One thing to note though, Is that I couldn't get it to work with lndclient v0.17.0-4 for some reason.
Not exactly sure what this is referring to? Because the current version of the PR doesn't modify the lndclient
version in the go.mod
file, so it's currently at v0.17.0-4
. What test branch of lndclient
were you talking about?
I suspect it's some migration thing in taproot-assets going on.
Yes, that's very likely it. If the version of taproot assets was changed between versions of PRs, a downgrade would explain the error. So all good I think.
Thanks a lot for the review @guggero 🎉🔥!
Ah sorry, I should have been more clear what I was referring to. This is referring to the test branch to test the restart functionality locally, that I mentioned in the original description message of this PR above:
Ok great, sounds good 🚀! |
This PR adds the functionality that in case we fail to setup the LND client(s) when starting
litd
, we will continuously retry to connect the LND clients until we succeed to do so. This partly addresses #683, but currently only restarts the LND clients when failing to setting them up, but not in case we fail to startlnd
. Thelit
sub-server will only be set to statusRunning
after the lnd clients have successfully been set up.This hopefully addresses most of the issues we've had from users that have experienced issues with the
lnd
sub-server erroring on the start up oflitd
, but likely not all of the errors.A note to reviewers:
litcli getinfo
+ thelitcli stop
commands are blocked and are not allowed, as we currently require macaroon authentication for those calls. Should I update this PR to allow those calls by adding them to theLiT
Whitelist?basic lnd client
as well as 3 failures to set up the fulllnd client
. The fulllnd client
set up attempts will also fail after 30 seconds, as that's what users experience when their set up fails due to thelnrpc.GetInfo
call context times out. You can use that branch to test this PR locally: https://github.com/ViktorTigerstrom/lightning-terminal/tree/2023-12-restart-lndclient-on-failure-mocked-lndclient-failsHappy to address any feedback!