Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hardening]: synchronously handle on-chain events during startup #8166

Open
morehouse opened this issue Nov 10, 2023 · 6 comments
Open

[hardening]: synchronously handle on-chain events during startup #8166

morehouse opened this issue Nov 10, 2023 · 6 comments
Labels
advanced Issues suitable for very experienced developers brainstorming Long term ideas/discussion/requests for feedback P2 should be fixed if one has time security General label for issues/PRs related to the security of the software server

Comments

@morehouse
Copy link
Collaborator

When LND starts up, it should handle any immediately serviceable on-chain events (breach retribution, sweeping HTLCs, etc.) prior to starting the connection manager. Once the connection manager is running, LND's attack surface widens greatly, so it is best to handle critical on-chain things beforehand.

During startup, we already wait for the wallet to sync with the blockchain before doing anything else. But we don't do the same for ChainArbitrator in server.Start(). This creates a race where it is possible for a peer to connect to our node before on-chain events have been handled.

This could be a problem if an attacker has a fast way to crash our node after connecting (e.g., ping of death). They could crash our node and broadcast a commitment tx while we're offline. Then every time our node tries to restart, the attacker may be able to crash it again before it punishes revoked commitments or sweeps HTLCs.

@morehouse morehouse added security General label for issues/PRs related to the security of the software server labels Nov 10, 2023
@saubyk saubyk added P1 MUST be fixed or reviewed brainstorming Long term ideas/discussion/requests for feedback labels Nov 11, 2023
@Roasbeef
Copy link
Member

Today we block start up until the chain backend is synced with the wallet:

lnd/lnd.go

Lines 648 to 672 in 64753b0

for {
if !interceptor.Alive() {
return nil
}
synced, ts, err := activeChainControl.Wallet.IsSynced()
if err != nil {
return mkErr("unable to determine if wallet is "+
"synced: %v", err)
}
ltndLog.Debugf("Syncing to block timestamp: %v, is synced=%v",
time.Unix(ts, 0), synced)
if synced {
break
}
time.Sleep(time.Second * 1)
}
_, bestHeight, err = activeChainControl.ChainIO.GetBestBlock()
if err != nil {
return mkErr("unable to determine chain tip: %v", err)
}

The sub-system that handles on-chain claims is created a bit later in the life cycle here:

lnd/server.go

Lines 1981 to 1985 in 64753b0

if err := s.chainArb.Start(); err != nil {
startErr = err
return
}

To accomplish this, we'll need to defer connecting to our set of persistent peers:

lnd/server.go

Lines 2091 to 2148 in 64753b0

// If peers are specified as a config option, we'll add those
// peers first.
for _, peerAddrCfg := range s.cfg.AddPeers {
parsedPubkey, parsedHost, err := lncfg.ParseLNAddressPubkey(
peerAddrCfg,
)
if err != nil {
startErr = fmt.Errorf("unable to parse peer "+
"pubkey from config: %v", err)
return
}
addr, err := parseAddr(parsedHost, s.cfg.net)
if err != nil {
startErr = fmt.Errorf("unable to parse peer "+
"address provided as a config option: "+
"%v", err)
return
}
peerAddr := &lnwire.NetAddress{
IdentityKey: parsedPubkey,
Address: addr,
ChainNet: s.cfg.ActiveNetParams.Net,
}
err = s.ConnectToPeer(
peerAddr, true,
s.cfg.ConnectionTimeout,
)
if err != nil {
startErr = fmt.Errorf("unable to connect to "+
"peer address provided as a config "+
"option: %v", err)
return
}
}
// Subscribe to NodeAnnouncements that advertise new addresses
// our persistent peers.
if err := s.updatePersistentPeerAddrs(); err != nil {
startErr = err
return
}
// With all the relevant sub-systems started, we'll now attempt
// to establish persistent connections to our direct channel
// collaborators within the network. Before doing so however,
// we'll prune our set of link nodes found within the database
// to ensure we don't reconnect to any nodes we no longer have
// open channels with.
if err := s.chanStateDB.PruneLinkNodes(); err != nil {
startErr = err
return
}
if err := s.establishPersistentConnections(); err != nil {
startErr = err
return
}

Then implement some async routine where the chain arb tells the server that all contracts have finished resolving. I think this should only be restricted to: breaches, and commitment outputs or HTLCs that have fully expired and are ready to sweep. Otherwise, we'd stay in that mode for days potentially.

@yyforyongyu yyforyongyu added the advanced Issues suitable for very experienced developers label Nov 21, 2023
@yyforyongyu
Copy link
Member

The start/stop order of subsystems can be tricky, some relevant context and previous work,

I think before attempting fixing this issue, we should refactor server.go first(#7283) so it's easier to reason about the startup process.

@saubyk saubyk added P2 should be fixed if one has time and removed P1 MUST be fixed or reviewed labels Nov 21, 2023
@ziggie1984
Copy link
Collaborator

Do you mean the Brontide Peer Manager by connection manager. ? I think based on experiences from the past I would even suggest going the other way around. We should first exchange channel reestablishment messages which could give us information whether our channel state is out of date and as a result not doing any channel-arbitrator actions for the explicit channel.

For example @xmrk-btc had the case where his node came online with an outdated channel.db, couple of channels had HTLCs on it (in the old channel.db state) which were timeout-out at the current blockheight, so even before the channel was reestablished initiating a SCB by the peer with the correct state (potentially), the channel-arbitrator immediately went onchain for the channel with the HTLC on it causing funds to be lost. So I would suggest starting the channelarbitrator for a channel only if our peer does not cooperate in the reestablishment msg.

So I think the self-harming scenario but also the attack scenario have to be combined for the best outcome.

@morehouse
Copy link
Collaborator Author

morehouse commented Mar 28, 2024

Do you mean the Brontide Peer Manager by connection manager. ?

There's a connection manager in server.go that manages the Brontide object for each peer.

I think based on experiences from the past I would even suggest going the other way around. We should first exchange channel reestablishment messages which could give us information whether our channel state is out of date and as a result not doing any channel-arbitrator actions for the explicit channel.

How would we detect if the peer is lying in the channel reestablish?

The peer could falsely claim to have a newer state, causing our node to attempt DLP and not start the channel arbitrator. Then the peer could steal funds onchain without our node detecting it.

The other problem is the attack surface widens dramatically after peers are allowed to connect -- peers can send anything they want over the connection, potentially crashing our node if there's a bug. If we always reconnect before starting channel arbitrator, the peer always gets a chance to crash us before we can detect stealing attempts.

@ziggie1984
Copy link
Collaborator

ziggie1984 commented Mar 28, 2024

How would we detect if the peer is lying in the channel reestablish?

From lightning/bolts#1049, I think you can tell that a peer is pretending to have a fake commitment state by not supplying the right your_last_per_commitment_secret for the state-1. For sure we can get the fake impression that we have the latest state when we don't have it, but at least the former case would prevent us from Force-Closing when our peer is honest and actually tells us that our state is outdated.

From the spec:

- if `next_revocation_number` is greater than expected above, AND
    `your_last_per_commitment_secret` is correct for that
    `next_revocation_number` minus 1:

The other problem is the attack surface widens dramatically after peers are allowed to connect -- peers can send anything they want over the connection, potentially crashing our node if there's a bug. If we always reconnect before starting channel arbitrator, the peer always gets a chance to crash us before we can detect stealing attempts.

I agree with that, I am inclined to also always think adversarial though I would also like to see some measurements in place which would prevent us harm ourselves. For example if our peer was honest and would have signaled us to not go onchain with an older state via the channel reestablishment msg.

@morehouse
Copy link
Collaborator Author

From lightning/bolts#1049, I think you can tell that a peer is pretending to have a fake commitment state by not supplying the right your_last_per_commitment_secret for the state-1. For sure we can get the fake impression that we have the latest state when we don't have it, but at least the former case would prevent us from Force-Closing when our peer is honest and actually tells us that our state is outdated.

Cool, thanks for explaining.

I agree with that, I am inclined to also always think adversarial though I would also like to see some measurements in place which would prevent us harm ourselves. For example if our peer was honest and would have signaled us to not go onchain with an older state via the channel reestablishment msg.

I think "safe mode" would be great for this, especially if we can enable it automatically when the node has been offline for a long time.


There's been a bit of discussion about this issue over at #8148 (comment), with different ideas about what should be done.

@ProofOfKeags mentioned isolating the contractcourt and sweeper entirely from the rest of LND, as a separate process. I think this is the holy grail for protecting funds under DoS, but it's a big project. It would be significantly more work than I intended to propose in this issue.

@Roasbeef suggested waiting for commitments and HTLCs to confirm on chain before allowing peer connections, providing similar isolation to running a separate process, but during startup only. The major downside of this is the 10+ minute delay (potentially hours) it would add to startup times before peers could connect.

I think we could get some quick incremental improvement over the current state without too much work or downside. Rather than waiting for transactions to confirm we could simply ensure they are broadcast before allowing peer connections. Concretely, we would do something like this:

  • Modify BreachArbitrator.start() to only return once initial justice transactions have been broadcast for any breaches loaded from the retribution store.
  • Modify ChainArbitrator.Start() to wait for all chainWatcher.Start() calls to finish before starting the ChannelArbitrators.
  • Modify chainWatcher.Start() to do a nonblocking check for spend notifications. If the funding outpoint was spent, handle it immediately without spawning a goroutine to wait for notifications.
  • Modify all ContractResolvers as follows:
    • Introduce a new InitResolve() method to do any nonblocking steps of Resolve() (e.g., publishing transactions, checking preimage store).
    • Reduce the scope of Resolve() to handling any blocking steps (e.g., waiting for transactions to confirm, waiting for new blocks).
  • Modify ChannelArbitrator.Start() to do a nonblocking check for chain events. If one is found, handle it immediately before spawning the channelAttendant goroutine. Any resolvers created should have InitResolve called and handled before spawning goroutines to do the Resolve() loops.
  • Modify server.Start() to call a new UtxoSweeper.ForceSweep() method after the UtxoSweeper, BreachArbitrator, and ChainArbitrator have started. ForceSweep will return after all pending sweeps have been broadcast.

With these changes, any transactions that need to be published at the current block should be published during startup. Goroutines are only spawned for future events, not things that can be handled immediately.

We'd want to do some benchmarking of startup times, but I suspect the effect will be relatively small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
advanced Issues suitable for very experienced developers brainstorming Long term ideas/discussion/requests for feedback P2 should be fixed if one has time security General label for issues/PRs related to the security of the software server
Projects
None yet
Development

No branches or pull requests

5 participants