multi: restart LND Client on setup failure #694

ViktorTigerstrom · 2023-12-12T02:05:18Z

This PR adds the functionality that in case we fail to setup the LND client(s) when starting litd, we will continuously retry to connect the LND clients until we succeed to do so. This partly addresses #683, but currently only restarts the LND clients when failing to setting them up, but not in case we fail to start lnd. The lit sub-server will only be set to status Running after the lnd clients have successfully been set up.
This hopefully addresses most of the issues we've had from users that have experienced issues with the lnd sub-server erroring on the start up of litd, but likely not all of the errors.

A note to reviewers:

Currently when failing to set up and therefore now retrying to create the lnd client(s), the litcli getinfo + the litcli stop commands are blocked and are not allowed, as we currently require macaroon authentication for those calls. Should I update this PR to allow those calls by adding them to the LiT Whitelist?
To simplify the local testing of this PR, I've created a branch that very simply mocks 3 failures to set up the basic lnd client as well as 3 failures to set up the full lnd client. The full lnd client set up attempts will also fail after 30 seconds, as that's what users experience when their set up fails due to the lnrpc.GetInfo call context times out. You can use that branch to test this PR locally: https://github.com/ViktorTigerstrom/lightning-terminal/tree/2023-12-restart-lndclient-on-failure-mocked-lndclient-fails

Happy to address any feedback!

ellemouton

Currently when failing to set up and therefore now retrying to create the lnd client(s), the litcli getinfo + the litcli stop commands are blocked and are not allowed, as we currently require macaroon authentication for those calls. Should I update this PR to allow those calls by adding them to the LiT Whitelist?

We cannot whitelist calls like litcli stop since that means that anyone can come along and shutdown our node without a macaroon.

I think the question is really: should we consider completely decoupling the Lit macaroon DB from LND's macaroon DB? Since then we can start it up without a dependence on LND.

I think this is worth considering since in future we might want LiT to connect to multiple LND nodes anyway and in that future we would definitely want LiT to have a decoupled macaroon DB.

EDIT: spoke to Oli about the above - for now, it's likely fine to just not allow the litcli stop & getinfo calls - users will still have access to litcli status and they will be able to use CTRL+C to trigger a shutdown.

Nice PR! left some suggestions.

ellemouton · 2023-12-12T06:40:04Z

status/manager.go

@@ -19,6 +19,10 @@ type SubServerStatus struct {
 	// been started.
 	Running bool

+	// Status will be a non-empty string that details the current status of
+	// the sub-server.
+	Status string


think we should add detail to the comment regarding how this differs from Disabled/Running.

also - do we need to have it duplicate info like "Registered", "Running", "Stopped" and "Errored"? since all of those already have booleans.

I've updated the comment :)! The reason I currently have it duplicate info for "Registered", "Running", "Stopped" and "Errored", is that I thought it probably will be even more confusing for the user that it will else be blank from time to time in the response for the litcli status cmd. If you think that I should remove it I will though :). Else I think a perhaps better solution would have been to remove the booleans for the litrpc.SubServerStatus struct and only keep this new status field instead, but I'm not a huge fan of removing them for backwards compatibility reasons..

Let me know what you'd prefer :)!

So I'd say that since it is called "manual status" it should be clear to the user that this is something special - so im thinking that we could probably just overwrite it to an empty string if any other status gets set.

status/manager.go

terminal.go

status/manager.go

ellemouton · 2023-12-12T06:56:21Z

subservers/manager.go

@@ -52,7 +52,7 @@ func NewManager(permsMgr *perms.Manager,
 // AddServer adds a new subServer to the manager's set.
 func (s *Manager) AddServer(ss SubServer, enable bool) {
 	// Register all sub-servers with the status server.
-	s.statusServer.RegisterSubServer(ss.Name())
+	s.statusServer.RegisterSubServer(ss.Name(), nil)


now about instead of changing the API here, we instead have functional options? that way we dont have to pass in nil for something that will be nil 90% of the time.

ie, something like this:

type SubServerOption func(status *subServer) // WithIsReadyOverride is a functional option that can be used to set a call // back function that is used to check if a system is ready _iff_ the system // running status is not yet true. The call-back will be passed the request URI // along with any manual status that has been set for the subsystem. func WithIsReadyOverride(fn func(string, string) (bool, bool)) SubServerOption { return func(status *subServer) { status.isReadyOverride = fn } } // RegisterSubServer will create a new sub-server entry for the Manager to // keep track of. func (s *Manager) RegisterSubServer(name string, opts ...SubServerOption) { s.mu.RLock() defer s.mu.RUnlock() s.registerSubServerUnsafe(name, true, opts...) } // RegisterAndEnableSubServer will create a new sub-server entry for the // Manager to keep track of and will set it as enabled. func (s *Manager) RegisterAndEnableSubServer(name string, opts ...SubServerOption) { s.mu.RLock() defer s.mu.RUnlock() s.registerSubServerUnsafe(name, false, opts...) } func (s *Manager) registerSubServerUnsafe(name string, disabled bool, opts ...SubServerOption) { ss := newSubServerStatus(disabled) for _, o := range opts { o(ss) } s.subServers[name] = ss }

then, IsSystemReady can be changed to:

func (s *Manager) IsSystemReady(name, req string) (bool, bool, error) { s.mu.RLock() defer s.mu.RUnlock() server, ok := s.subServers[name] if !ok { return false, false, errors.New("a sub-server with " + "name %s has not yet been registered") } if server.disabled { return false, true, nil } if server.running { return true, false, nil } if server.isReadyOverride == nil { return false, false, nil } isReady, handled := server.isReadyOverride(req, server.manualStatus) if handled { return false, false, nil } return isReady, false, nil }

then the code in terminal.go becomes:

lndOverride := func(uri, manualStatus string) (bool, bool) { if uri != "/lnrpc.State/GetState" { return false, false } return manualStatus == lndWalletReadyStatus, true } // Register LND, LiT and Accounts with the status manager. g.statusMgr.RegisterAndEnableSubServer( subservers.LND, status.WithIsReadyOverride(lndOverride), ) g.statusMgr.RegisterAndEnableSubServer(subservers.LIT) g.statusMgr.RegisterSubServer(subservers.ACCOUNTS)

Thanks a lot for the feedback @ellemouton 🔥🚀!! I've updated the PR to implement the above, with smaller modifications.

Definitely agree that the new version is much cleaner!

status/manager.go

terminal.go

ViktorTigerstrom · 2023-12-14T02:32:05Z

Thanks a lot for the feedback and for reviewing this so quickly @ellemouton! I've addressed your feedback with the latest push.

ellemouton

Nice work @ViktorTigerstrom ⭐

This is basically there 👍 last major comment is regarding preventing the loops from spinning - once that is addressed - I'll do a final testing round

status/manager.go

litrpc/lit-status.proto

terminal.go

status/manager.go

ellemouton · 2023-12-14T10:12:05Z

status/manager.go

+	var isReady, handled bool
+
+	if server.isReadyOverride != nil {
+		isReady, handled = server.isReadyOverride(
+			req, server.manualStatus,
+		)
+	}
+
+	if !handled || server.Running {
+		return server.Running, false, nil
+	}
+
+	return isReady, false, nil
+}


I find this a bit hard to read.

How about:

if server.disabled { return false, true, nil } // If this system has an override check set, then we first check that // to see if it overrides the "ready" status of the system. if server.isReadyOverride != nil { isReady, handled := server.isReadyOverride( req, server.manualStatus, ) if handled { return isReady, false, nil } } return server.running, false, nil

Thanks, I've updated to your version, with a slight modification to the first if check as we also need to ensure that the system isn't marked as running when we let the override function define if the request should be accepted or not. Hope that didn't ruin the readability too much :)!

terminal.go

status/manager.go

ViktorTigerstrom · 2023-12-14T20:09:33Z

Thanks again for reviewing this PR so quickly @ellemouton 🚀🔥!!! I've addressed your latest feedback with the latest push :)

guggero

Nice change, will be very helpful to users to know what's going on under the hood!
Nothing major from my end, just a couple of comments around the re-try loop itself.

status/manager.go

litrpc/lit-status.proto

guggero · 2023-12-20T10:48:42Z

terminal.go

+		case <-interceptor.ShutdownChannel():
+			return fmt.Errorf("received the shutdown signal")
+
+		case <-time.After(defaultStartupTimeout):


If we want this to be the maximum time we wait in total, then we need to create the time.After() channel outside of the callback. Otherwise we wait this long for every re-try (since we create a new ticker channel each time).

Or do we want to wait 5 seconds every time? Wasn't super clear to me from the comment above.

Or do we want to wait 5 seconds every time? Wasn't super clear to me from the comment above.

Yeah this was my intention, so that we don't spin as per @ellemouton's comment #694 (comment), but still try to reinitiate the lndclient(s) on failure to set them up.

guggero · 2023-12-20T10:52:38Z

terminal.go

+				Insecure:              insecure,
+				CustomMacaroonPath:    macPath,
+				CustomMacaroonHex:     hex.EncodeToString(macData),
+				BlockUntilChainSynced: true,


Since we are passing in BlockUntilChainSynced: true here, we don't actually need to re-try ourselves, as that will already initiate a re-try loop, as long as the ctxc context we pass in doesn't expire too early.

Same goes for the BlockUntilUnlocked: true I guess. So maybe we could also just try to initiate the full client first, as that already does all the re-trying we want and need and then can initiate the basic client afterwards where we can be pretty certain that it should succeed?

I guess what we lose that way would be the extra info exposed to the user with the SetErrored field. So I'm fine with leaving the order the way it is since it gives the user more information. But I'm fairly certain we don't need the re-try for loop here as well.

Since we are passing in BlockUntilChainSynced: true here, we don't actually need to re-try ourselves, as that will already initiate a re-try loop, as long as the ctxc context we pass in doesn't expire too early.

So unfortunately this isn't really the case, and logs we have seen from users, such as @Liongrass, have had the ctxc time out during the chain sync of lnd, when lnd took for example 2mins+ to successfully sync. The GetInfo call that's made during the chain sync in the lndclient then times out, and therefore the set up of the full client fails. If we would have retried to set up the full lndclient during those cases, the lnd sub-server in litd would have eventually successfully been started.
One potential alternative solution could have been to increase the ctxc timeout, but as there's no specific context just for the chain sync, that would mean that any future RPC call made after the chain sync would also use the updated timeout, which wouldn't be ideal..

This is why we decided to take the approach to attempt to set up the lndclient indefinitely until it is successful as a temporary fix, until we implement a more general solution that restarts litd sub-servers when required to restart.

Ah right, good point. Makes sense.

lightninglabs-deploy · 2023-12-27T11:34:39Z

@ellemouton: review reminder
@ViktorTigerstrom, remember to re-request review from reviewers when ready

ellemouton

Looks good to me! Contingent on what Oli thinks re the retry loops :)

Btw - think you need to update your lndclient test branch as the current redirect commit doesnt compile.

Notes from testing:

saw this getting logged: Setting the lit sub-server as errored: Error %d when creating LND Services client: %v - gotta make sure that the values are populated in the log. But maybe this is just a factor of the test? Also saw this:
STAT: Setting the lit sub-server as errored: Error when setting up basic LND Client: %v
this is probably not related to this PR but I get this error for taproot-assets: "error": "unable to initialize RPC server: unable to start asset custodian: error listing existing addresses: unknown sqlite error: SQL logic error: no such column: asset_version (1)" have you seen this before? perhaps something we should report to the assets team?

status/manager.go

litrpc/lit-status.proto

terminal.go

status/manager.go

rpc_proxy.go

terminal.go

Refactor Register & Enable functions to enable setting of SubServerOptions, i.e. functional options, when initiating a SubServer.

isReadyOverride is a callback that, when set and only if `running` is not yet true, will be used to determine if a system is ready for a call. We will pass the request URI to this method along with the `manualStatus`. The first returned boolean is true if the system should be seen as ready and the second is true if the override does handle the given request. If it does not, then we will fall back to our normal is-ready check.

Ensure that the sub-server error string is set to an empty string when marking a sub-server as running. This will be important for the next commit, which will retry to start the lnd sub-server even if we error when starting the lnd-client(s), and therefore may mark the lnd sub-server as running even after we've errored when starting the lnd-client(s).

ViktorTigerstrom · 2024-01-05T03:03:17Z

Thanks a lot for the review @ellemouton 🎉!!

I've addressed your feedback with the latest push!

Btw - think you need to update your lndclient test branch as the current redirect commit doesn't compile.

I've now updated the branch so that it builds and can be used for testing! One thing to note though, Is that I couldn't get it to work with lndclient v0.17.0-4 for some reason. There seems to be some conflict with taproot-assets that I just couldn't figure out how to solve within a reasonable time. So In the test branch I've downgraded to v0.17.0-3, which subsequently downgraded taproot-assets as well. But it can be used to test the functionality of this PR!

saw this getting logged: Setting the lit sub-server as errored: Error %d when creating LND Services client: %v - gotta make sure that the values are populated in the log.

Thanks! This was a pre-existing error that I've now addressed with 3b83d6e :)

this is probably not related to this PR but I get this error for taproot-assets:

I hadn't seen this, but as I started downgrading the lndclient and therefore the taproot-assets version to make the test branch compile, and then upgraded again and such, I actually managed to trigger this as well! Did you do something similar before the error occurred for you as well?
However, after I removed testnet folder (my configured netwok) in the taproot-assets Application data, and restarted litd again, the error disappeared. I suspect it's some migration thing in taproot-assets going on. But as my way of making the error get triggered is probably nothing a normal user will do, I'm not sure if this is something we should actually report. Though it would be really good to get @guggero's opinion on this as well!

guggero

Code looks good to me, very nice work!

One thing to note though, Is that I couldn't get it to work with lndclient v0.17.0-4 for some reason.

Not exactly sure what this is referring to? Because the current version of the PR doesn't modify the lndclient version in the go.mod file, so it's currently at v0.17.0-4. What test branch of lndclient were you talking about?

I suspect it's some migration thing in taproot-assets going on.

Yes, that's very likely it. If the version of taproot assets was changed between versions of PRs, a downgrade would explain the error. So all good I think.

ViktorTigerstrom · 2024-01-05T10:42:25Z

Thanks a lot for the review @guggero 🎉🔥!

Not exactly sure what this is referring to? Because the current version of the PR doesn't modify the lndclient version in the go.mod file, so it's currently at v0.17.0-4. What test branch of lndclient were you talking about?

Ah sorry, I should have been more clear what I was referring to. This is referring to the test branch to test the restart functionality locally, that I mentioned in the original description message of this PR above:
https://github.com/ViktorTigerstrom/lightning-terminal/tree/2023-12-restart-lndclient-on-failure-mocked-lndclient-fails

Yes, that's very likely it. If the version of taproot assets was changed between versions of PRs, a downgrade would explain the error. So all good I think.

Ok great, sounds good 🚀!

ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch 2 times, most recently from 083a91e to fbfd1d8 Compare December 12, 2023 02:30

ViktorTigerstrom requested a review from ellemouton December 12, 2023 02:31

ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch from fbfd1d8 to e73cfb6 Compare December 12, 2023 02:41

ellemouton reviewed Dec 12, 2023

View reviewed changes

ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch 5 times, most recently from 9e605f0 to 42c7916 Compare December 14, 2023 02:26

ViktorTigerstrom requested a review from ellemouton December 14, 2023 02:32

ellemouton reviewed Dec 14, 2023

View reviewed changes

levmi assigned ViktorTigerstrom Dec 14, 2023

ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch from 42c7916 to 3c325b3 Compare December 14, 2023 19:52

ViktorTigerstrom requested a review from ellemouton December 14, 2023 20:08

levmi requested a review from guggero December 15, 2023 17:31

guggero reviewed Dec 20, 2023

View reviewed changes

ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch from 3c325b3 to dfd190a Compare January 3, 2024 01:00

ellemouton approved these changes Jan 4, 2024

View reviewed changes

guggero self-requested a review January 4, 2024 08:43

ViktorTigerstrom added 7 commits January 5, 2024 03:42

status: add functional opts on SubServerStatus init

2d726f8

Refactor Register & Enable functions to enable setting of SubServerOptions, i.e. functional options, when initiating a SubServer.

status: format log row in SetErrored correctly

3b83d6e

status: enable setting custom sub-server statuses

3d4c66c

protos: add Subserver custom_status to protos

4a46e41

terminal: add a lnd Wallet Ready state

3d43572

multi: use request overrides in RPC proxy

e4c7275

ViktorTigerstrom added 4 commits January 5, 2024 03:46

terminal: allow lnd GetState RPC on Wallet Ready

b0dbbe9

terminal: retry to create lnd client on failure

33d6f6e

status: unexport SubServerStatus & rename to subServer

fcd22a7

ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch from dfd190a to fcd22a7 Compare January 5, 2024 02:47

guggero approved these changes Jan 5, 2024

View reviewed changes

ellemouton merged commit ee69bbd into lightninglabs:master Jan 5, 2024
12 checks passed

ViktorTigerstrom mentioned this pull request Jan 15, 2024

status: apply functional options on subServer init #705

Merged

ellemouton mentioned this pull request Feb 13, 2024

Umbrel Lightning Terminal #593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi: restart LND Client on setup failure #694

multi: restart LND Client on setup failure #694

ViktorTigerstrom commented Dec 12, 2023

ellemouton left a comment •

edited

Loading

ellemouton Dec 12, 2023

ellemouton Dec 12, 2023

ViktorTigerstrom Dec 14, 2023

ellemouton Dec 14, 2023

ellemouton Dec 12, 2023

ellemouton Dec 12, 2023

ellemouton Dec 12, 2023

ellemouton Dec 12, 2023

ViktorTigerstrom Dec 14, 2023

ViktorTigerstrom commented Dec 14, 2023

ellemouton left a comment

ellemouton Dec 14, 2023

ViktorTigerstrom Dec 14, 2023

ViktorTigerstrom commented Dec 14, 2023

guggero left a comment

guggero Dec 20, 2023

guggero Dec 20, 2023

ViktorTigerstrom Jan 3, 2024

guggero Dec 20, 2023

guggero Dec 20, 2023

guggero Dec 20, 2023

ViktorTigerstrom Jan 3, 2024

guggero Jan 5, 2024

lightninglabs-deploy commented Dec 27, 2023

ellemouton left a comment

ViktorTigerstrom commented Jan 5, 2024

guggero left a comment

ViktorTigerstrom commented Jan 5, 2024

multi: restart LND Client on setup failure #694

multi: restart LND Client on setup failure #694

Conversation

ViktorTigerstrom commented Dec 12, 2023

ellemouton left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Dec 14, 2023

ellemouton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Dec 14, 2023

guggero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lightninglabs-deploy commented Dec 27, 2023

ellemouton left a comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Jan 5, 2024

guggero left a comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Jan 5, 2024

ellemouton left a comment •

edited

Loading