Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi: restart LND Client on setup failure #694

Conversation

ViktorTigerstrom
Copy link
Contributor

This PR adds the functionality that in case we fail to setup the LND client(s) when starting litd, we will continuously retry to connect the LND clients until we succeed to do so. This partly addresses #683, but currently only restarts the LND clients when failing to setting them up, but not in case we fail to start lnd. The lit sub-server will only be set to status Running after the lnd clients have successfully been set up.
This hopefully addresses most of the issues we've had from users that have experienced issues with the lnd sub-server erroring on the start up of litd, but likely not all of the errors.

A note to reviewers:

  1. Currently when failing to set up and therefore now retrying to create the lnd client(s), the litcli getinfo + the litcli stop commands are blocked and are not allowed, as we currently require macaroon authentication for those calls. Should I update this PR to allow those calls by adding them to the LiT Whitelist?
  2. To simplify the local testing of this PR, I've created a branch that very simply mocks 3 failures to set up the basic lnd client as well as 3 failures to set up the full lnd client. The full lnd client set up attempts will also fail after 30 seconds, as that's what users experience when their set up fails due to the lnrpc.GetInfo call context times out. You can use that branch to test this PR locally: https://github.com/ViktorTigerstrom/lightning-terminal/tree/2023-12-restart-lndclient-on-failure-mocked-lndclient-fails

Happy to address any feedback!

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch 2 times, most recently from 083a91e to fbfd1d8 Compare December 12, 2023 02:30
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch from fbfd1d8 to e73cfb6 Compare December 12, 2023 02:41
Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently when failing to set up and therefore now retrying to create the lnd client(s), the litcli getinfo + the litcli stop commands are blocked and are not allowed, as we currently require macaroon authentication for those calls. Should I update this PR to allow those calls by adding them to the LiT Whitelist?

We cannot whitelist calls like litcli stop since that means that anyone can come along and shutdown our node without a macaroon.

I think the question is really: should we consider completely decoupling the Lit macaroon DB from LND's macaroon DB? Since then we can start it up without a dependence on LND.

I think this is worth considering since in future we might want LiT to connect to multiple LND nodes anyway and in that future we would definitely want LiT to have a decoupled macaroon DB.

EDIT: spoke to Oli about the above - for now, it's likely fine to just not allow the litcli stop & getinfo calls - users will still have access to litcli status and they will be able to use CTRL+C to trigger a shutdown.

Nice PR! left some suggestions.

@@ -19,6 +19,10 @@ type SubServerStatus struct {
// been started.
Running bool

// Status will be a non-empty string that details the current status of
// the sub-server.
Status string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think we should add detail to the comment regarding how this differs from Disabled/Running.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also - do we need to have it duplicate info like "Registered", "Running", "Stopped" and "Errored"? since all of those already have booleans.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comment :)! The reason I currently have it duplicate info for "Registered", "Running", "Stopped" and "Errored", is that I thought it probably will be even more confusing for the user that it will else be blank from time to time in the response for the litcli status cmd. If you think that I should remove it I will though :). Else I think a perhaps better solution would have been to remove the booleans for the litrpc.SubServerStatus struct and only keep this new status field instead, but I'm not a huge fan of removing them for backwards compatibility reasons..

Let me know what you'd prefer :)!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'd say that since it is called "manual status" it should be clear to the user that this is something special - so im thinking that we could probably just overwrite it to an empty string if any other status gets set.

status/manager.go Outdated Show resolved Hide resolved
terminal.go Outdated Show resolved Hide resolved
status/manager.go Outdated Show resolved Hide resolved
status/manager.go Show resolved Hide resolved
@@ -52,7 +52,7 @@ func NewManager(permsMgr *perms.Manager,
// AddServer adds a new subServer to the manager's set.
func (s *Manager) AddServer(ss SubServer, enable bool) {
// Register all sub-servers with the status server.
s.statusServer.RegisterSubServer(ss.Name())
s.statusServer.RegisterSubServer(ss.Name(), nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now about instead of changing the API here, we instead have functional options? that way we dont have to pass in nil for something that will be nil 90% of the time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie, something like this:

type SubServerOption func(status *subServer)

// WithIsReadyOverride is a functional option that can be used to set a call
// back function that is used to check if a system is ready _iff_ the system
// running status is not yet true. The call-back will be passed the request URI
// along with any manual status that has been set for the subsystem.
func WithIsReadyOverride(fn func(string, string) (bool, bool)) SubServerOption {
	return func(status *subServer) {
		status.isReadyOverride = fn
	}
}

// RegisterSubServer will create a new sub-server entry for the Manager to
// keep track of.
func (s *Manager) RegisterSubServer(name string, opts ...SubServerOption) {
	s.mu.RLock()
	defer s.mu.RUnlock()

	s.registerSubServerUnsafe(name, true, opts...)
}

// RegisterAndEnableSubServer will create a new sub-server entry for the
// Manager to keep track of and will set it as enabled.
func (s *Manager) RegisterAndEnableSubServer(name string,
	opts ...SubServerOption) {

	s.mu.RLock()
	defer s.mu.RUnlock()

	s.registerSubServerUnsafe(name, false, opts...)
}

func (s *Manager) registerSubServerUnsafe(name string, disabled bool,
	opts ...SubServerOption) {

	ss := newSubServerStatus(disabled)

	for _, o := range opts {
		o(ss)
	}

	s.subServers[name] = ss
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, IsSystemReady can be changed to:

func (s *Manager) IsSystemReady(name, req string) (bool, bool, error) {
	s.mu.RLock()
	defer s.mu.RUnlock()

	server, ok := s.subServers[name]
	if !ok {
		return false, false, errors.New("a sub-server with " +
			"name %s has not yet been registered")
	}

	if server.disabled {
		return false, true, nil
	}

	if server.running {
		return true, false, nil
	}

	if server.isReadyOverride == nil {
		return false, false, nil
	}

	isReady, handled := server.isReadyOverride(req, server.manualStatus)
	if handled {
		return false, false, nil
	}

	return isReady, false, nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then the code in terminal.go becomes:

	lndOverride := func(uri, manualStatus string) (bool, bool) {
		if uri != "/lnrpc.State/GetState" {
			return false, false
		}

		return manualStatus == lndWalletReadyStatus, true
	}

	// Register LND, LiT and Accounts with the status manager.
	g.statusMgr.RegisterAndEnableSubServer(
		subservers.LND, status.WithIsReadyOverride(lndOverride),
	)
	g.statusMgr.RegisterAndEnableSubServer(subservers.LIT)
	g.statusMgr.RegisterSubServer(subservers.ACCOUNTS)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the feedback @ellemouton 🔥🚀!! I've updated the PR to implement the above, with smaller modifications.

Definitely agree that the new version is much cleaner!

status/manager.go Outdated Show resolved Hide resolved
status/manager.go Outdated Show resolved Hide resolved
terminal.go Show resolved Hide resolved
terminal.go Show resolved Hide resolved
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch 5 times, most recently from 9e605f0 to 42c7916 Compare December 14, 2023 02:26
@ViktorTigerstrom
Copy link
Contributor Author

Thanks a lot for the feedback and for reviewing this so quickly @ellemouton! I've addressed your feedback with the latest push.

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @ViktorTigerstrom

This is basically there 👍 last major comment is regarding preventing the loops from spinning - once that is addressed - I'll do a final testing round

status/manager.go Outdated Show resolved Hide resolved
litrpc/lit-status.proto Outdated Show resolved Hide resolved
terminal.go Outdated Show resolved Hide resolved
status/manager.go Outdated Show resolved Hide resolved
Comment on lines 95 to 113
var isReady, handled bool

if server.isReadyOverride != nil {
isReady, handled = server.isReadyOverride(
req, server.manualStatus,
)
}

if !handled || server.Running {
return server.Running, false, nil
}

return isReady, false, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit hard to read.

How about:

        if server.disabled {
		return false, true, nil
	}

	// If this system has an override check set, then we first check that
	// to see if it overrides the "ready" status of the system.
	if server.isReadyOverride != nil {
		isReady, handled := server.isReadyOverride(
			req, server.manualStatus,
		)
		if handled {
			return isReady, false, nil
		}
	}

	return server.running, false, nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've updated to your version, with a slight modification to the first if check as we also need to ensure that the system isn't marked as running when we let the override function define if the request should be accepted or not. Hope that didn't ruin the readability too much :)!

terminal.go Show resolved Hide resolved
terminal.go Show resolved Hide resolved
terminal.go Outdated Show resolved Hide resolved
status/manager.go Outdated Show resolved Hide resolved
@ViktorTigerstrom
Copy link
Contributor Author

Thanks again for reviewing this PR so quickly @ellemouton 🚀🔥!!! I've addressed your latest feedback with the latest push :)

@levmi levmi requested a review from guggero December 15, 2023 17:31
Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change, will be very helpful to users to know what's going on under the hood!
Nothing major from my end, just a couple of comments around the re-try loop itself.

status/manager.go Outdated Show resolved Hide resolved
litrpc/lit-status.proto Outdated Show resolved Hide resolved
case <-interceptor.ShutdownChannel():
return fmt.Errorf("received the shutdown signal")

case <-time.After(defaultStartupTimeout):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want this to be the maximum time we wait in total, then we need to create the time.After() channel outside of the callback. Otherwise we wait this long for every re-try (since we create a new ticker channel each time).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do we want to wait 5 seconds every time? Wasn't super clear to me from the comment above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do we want to wait 5 seconds every time? Wasn't super clear to me from the comment above.

Yeah this was my intention, so that we don't spin as per @ellemouton's comment #694 (comment), but still try to reinitiate the lndclient(s) on failure to set them up.

Insecure: insecure,
CustomMacaroonPath: macPath,
CustomMacaroonHex: hex.EncodeToString(macData),
BlockUntilChainSynced: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are passing in BlockUntilChainSynced: true here, we don't actually need to re-try ourselves, as that will already initiate a re-try loop, as long as the ctxc context we pass in doesn't expire too early.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for the BlockUntilUnlocked: true I guess. So maybe we could also just try to initiate the full client first, as that already does all the re-trying we want and need and then can initiate the basic client afterwards where we can be pretty certain that it should succeed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what we lose that way would be the extra info exposed to the user with the SetErrored field. So I'm fine with leaving the order the way it is since it gives the user more information. But I'm fairly certain we don't need the re-try for loop here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are passing in BlockUntilChainSynced: true here, we don't actually need to re-try ourselves, as that will already initiate a re-try loop, as long as the ctxc context we pass in doesn't expire too early.

So unfortunately this isn't really the case, and logs we have seen from users, such as @Liongrass, have had the ctxc time out during the chain sync of lnd, when lnd took for example 2mins+ to successfully sync. The GetInfo call that's made during the chain sync in the lndclient then times out, and therefore the set up of the full client fails. If we would have retried to set up the full lndclient during those cases, the lnd sub-server in litd would have eventually successfully been started.
One potential alternative solution could have been to increase the ctxc timeout, but as there's no specific context just for the chain sync, that would mean that any future RPC call made after the chain sync would also use the updated timeout, which wouldn't be ideal..

This is why we decided to take the approach to attempt to set up the lndclient indefinitely until it is successful as a temporary fix, until we implement a more general solution that restarts litd sub-servers when required to restart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, good point. Makes sense.

@lightninglabs-deploy
Copy link

@ellemouton: review reminder
@ViktorTigerstrom, remember to re-request review from reviewers when ready

@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch from 3c325b3 to dfd190a Compare January 3, 2024 01:00
Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Contingent on what Oli thinks re the retry loops :)

Btw - think you need to update your lndclient test branch as the current redirect commit doesnt compile.

Notes from testing:

  • saw this getting logged: Setting the lit sub-server as errored: Error %d when creating LND Services client: %v - gotta make sure that the values are populated in the log. But maybe this is just a factor of the test? Also saw this:
    STAT: Setting the lit sub-server as errored: Error when setting up basic LND Client: %v
  • this is probably not related to this PR but I get this error for taproot-assets: "error": "unable to initialize RPC server: unable to start asset custodian: error listing existing addresses: unknown sqlite error: SQL logic error: no such column: asset_version (1)" have you seen this before? perhaps something we should report to the assets team?

status/manager.go Outdated Show resolved Hide resolved
litrpc/lit-status.proto Outdated Show resolved Hide resolved
terminal.go Show resolved Hide resolved
status/manager.go Outdated Show resolved Hide resolved
status/manager.go Outdated Show resolved Hide resolved
rpc_proxy.go Show resolved Hide resolved
terminal.go Outdated Show resolved Hide resolved
terminal.go Outdated Show resolved Hide resolved
@guggero guggero self-requested a review January 4, 2024 08:43
Refactor Register & Enable functions to enable setting of
SubServerOptions, i.e. functional options, when initiating a
SubServer.
isReadyOverride is a callback that, when set and only if `running` is
not yet true, will be used to determine if a system is ready for a call.
We will pass the request URI to this method along with the
`manualStatus`. The first returned boolean is true if the system should
be seen as ready and the second is true if the override does handle the
given request. If it does not, then we will fall back to our normal
is-ready check.
Ensure that the sub-server error string is set to an empty string when
marking a sub-server as running. This will be important for the next
commit, which will retry to start the lnd sub-server even if we error
when starting the lnd-client(s), and therefore may mark the lnd
sub-server as running even after we've errored when starting the
lnd-client(s).
@ViktorTigerstrom ViktorTigerstrom force-pushed the 2023-11-restart-lndclient-on-failure branch from dfd190a to fcd22a7 Compare January 5, 2024 02:47
@ViktorTigerstrom
Copy link
Contributor Author

Thanks a lot for the review @ellemouton 🎉!!

I've addressed your feedback with the latest push!

Btw - think you need to update your lndclient test branch as the current redirect commit doesn't compile.

I've now updated the branch so that it builds and can be used for testing! One thing to note though, Is that I couldn't get it to work with lndclient v0.17.0-4 for some reason. There seems to be some conflict with taproot-assets that I just couldn't figure out how to solve within a reasonable time. So In the test branch I've downgraded to v0.17.0-3, which subsequently downgraded taproot-assets as well. But it can be used to test the functionality of this PR!

saw this getting logged: Setting the lit sub-server as errored: Error %d when creating LND Services client: %v - gotta make sure that the values are populated in the log.

Thanks! This was a pre-existing error that I've now addressed with 3b83d6e :)

this is probably not related to this PR but I get this error for taproot-assets:

I hadn't seen this, but as I started downgrading the lndclient and therefore the taproot-assets version to make the test branch compile, and then upgraded again and such, I actually managed to trigger this as well! Did you do something similar before the error occurred for you as well?
However, after I removed testnet folder (my configured netwok) in the taproot-assets Application data, and restarted litd again, the error disappeared. I suspect it's some migration thing in taproot-assets going on. But as my way of making the error get triggered is probably nothing a normal user will do, I'm not sure if this is something we should actually report. Though it would be really good to get @guggero's opinion on this as well!

Copy link
Member

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me, very nice work!

One thing to note though, Is that I couldn't get it to work with lndclient v0.17.0-4 for some reason.

Not exactly sure what this is referring to? Because the current version of the PR doesn't modify the lndclient version in the go.mod file, so it's currently at v0.17.0-4. What test branch of lndclient were you talking about?

I suspect it's some migration thing in taproot-assets going on.

Yes, that's very likely it. If the version of taproot assets was changed between versions of PRs, a downgrade would explain the error. So all good I think.

@ViktorTigerstrom
Copy link
Contributor Author

Thanks a lot for the review @guggero 🎉🔥!

Not exactly sure what this is referring to? Because the current version of the PR doesn't modify the lndclient version in the go.mod file, so it's currently at v0.17.0-4. What test branch of lndclient were you talking about?

Ah sorry, I should have been more clear what I was referring to. This is referring to the test branch to test the restart functionality locally, that I mentioned in the original description message of this PR above:
https://github.com/ViktorTigerstrom/lightning-terminal/tree/2023-12-restart-lndclient-on-failure-mocked-lndclient-fails

Yes, that's very likely it. If the version of taproot assets was changed between versions of PRs, a downgrade would explain the error. So all good I think.

Ok great, sounds good 🚀!

@ellemouton ellemouton merged commit ee69bbd into lightninglabs:master Jan 5, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants