[hardening]: synchronously handle on-chain events during startup #8166

morehouse · 2023-11-10T21:32:27Z

When LND starts up, it should handle any immediately serviceable on-chain events (breach retribution, sweeping HTLCs, etc.) prior to starting the connection manager. Once the connection manager is running, LND's attack surface widens greatly, so it is best to handle critical on-chain things beforehand.

During startup, we already wait for the wallet to sync with the blockchain before doing anything else. But we don't do the same for ChainArbitrator in server.Start(). This creates a race where it is possible for a peer to connect to our node before on-chain events have been handled.

This could be a problem if an attacker has a fast way to crash our node after connecting (e.g., ping of death). They could crash our node and broadcast a commitment tx while we're offline. Then every time our node tries to restart, the attacker may be able to crash it again before it punishes revoked commitments or sweeps HTLCs.

The text was updated successfully, but these errors were encountered:

Roasbeef · 2023-11-16T23:13:39Z

Today we block start up until the chain backend is synced with the wallet:

lnd/lnd.go

Lines 648 to 672 in 64753b0

    
           for { 
        
           	if !interceptor.Alive() { 
        
           		return nil 
        
           	} 
        
           	synced, ts, err := activeChainControl.Wallet.IsSynced() 
        
           	if err != nil { 
        
           		return mkErr("unable to determine if wallet is "+ 
        
           			"synced: %v", err) 
        
           	} 
        
           	ltndLog.Debugf("Syncing to block timestamp: %v, is synced=%v", 
        
           		time.Unix(ts, 0), synced) 
        
           	if synced { 
        
           		break 
        
           	} 
        
           	time.Sleep(time.Second * 1) 
        
           } 
        
           _, bestHeight, err = activeChainControl.ChainIO.GetBestBlock() 
        
           if err != nil { 
        
           	return mkErr("unable to determine chain tip: %v", err) 
        
           }

The sub-system that handles on-chain claims is created a bit later in the life cycle here:

lnd/server.go

Lines 1981 to 1985 in 64753b0

    
           if err := s.chainArb.Start(); err != nil { 
        
           	startErr = err 
        
           	return 
        
           }

To accomplish this, we'll need to defer connecting to our set of persistent peers:

lnd/server.go

Lines 2091 to 2148 in 64753b0

    
           // If peers are specified as a config option, we'll add those 
        
           // peers first. 
        
           for _, peerAddrCfg := range s.cfg.AddPeers { 
        
           	parsedPubkey, parsedHost, err := lncfg.ParseLNAddressPubkey( 
        
           		peerAddrCfg, 
        
           	) 
        
           	if err != nil { 
        
           		startErr = fmt.Errorf("unable to parse peer "+ 
        
           			"pubkey from config: %v", err) 
        
           		return 
        
           	} 
        
           	addr, err := parseAddr(parsedHost, s.cfg.net) 
        
           	if err != nil { 
        
           		startErr = fmt.Errorf("unable to parse peer "+ 
        
           			"address provided as a config option: "+ 
        
           			"%v", err) 
        
           		return 
        
           	} 
        
           	peerAddr := &lnwire.NetAddress{ 
        
           		IdentityKey: parsedPubkey, 
        
           		Address:     addr, 
        
           		ChainNet:    s.cfg.ActiveNetParams.Net, 
        
           	} 
        
           	err = s.ConnectToPeer( 
        
           		peerAddr, true, 
        
           		s.cfg.ConnectionTimeout, 
        
           	) 
        
           	if err != nil { 
        
           		startErr = fmt.Errorf("unable to connect to "+ 
        
           			"peer address provided as a config "+ 
        
           			"option: %v", err) 
        
           		return 
        
           	} 
        
           } 
        
           // Subscribe to NodeAnnouncements that advertise new addresses 
        
           // our persistent peers. 
        
           if err := s.updatePersistentPeerAddrs(); err != nil { 
        
           	startErr = err 
        
           	return 
        
           } 
        
           // With all the relevant sub-systems started, we'll now attempt 
        
           // to establish persistent connections to our direct channel 
        
           // collaborators within the network. Before doing so however, 
        
           // we'll prune our set of link nodes found within the database 
        
           // to ensure we don't reconnect to any nodes we no longer have 
        
           // open channels with. 
        
           if err := s.chanStateDB.PruneLinkNodes(); err != nil { 
        
           	startErr = err 
        
           	return 
        
           } 
        
           if err := s.establishPersistentConnections(); err != nil { 
        
           	startErr = err 
        
           	return 
        
           }

Then implement some async routine where the chain arb tells the server that all contracts have finished resolving. I think this should only be restricted to: breaches, and commitment outputs or HTLCs that have fully expired and are ready to sweep. Otherwise, we'd stay in that mode for days potentially.

yyforyongyu · 2023-11-21T14:58:20Z

The start/stop order of subsystems can be tricky, some relevant context and previous work,

I think before attempting fixing this issue, we should refactor server.go first(#7283) so it's easier to reason about the startup process.

ziggie1984 · 2024-03-28T00:39:04Z

Do you mean the Brontide Peer Manager by connection manager. ? I think based on experiences from the past I would even suggest going the other way around. We should first exchange channel reestablishment messages which could give us information whether our channel state is out of date and as a result not doing any channel-arbitrator actions for the explicit channel.

For example @xmrk-btc had the case where his node came online with an outdated channel.db, couple of channels had HTLCs on it (in the old channel.db state) which were timeout-out at the current blockheight, so even before the channel was reestablished initiating a SCB by the peer with the correct state (potentially), the channel-arbitrator immediately went onchain for the channel with the HTLC on it causing funds to be lost. So I would suggest starting the channelarbitrator for a channel only if our peer does not cooperate in the reestablishment msg.

So I think the self-harming scenario but also the attack scenario have to be combined for the best outcome.

morehouse · 2024-03-28T13:59:39Z

Do you mean the Brontide Peer Manager by connection manager. ?

There's a connection manager in server.go that manages the Brontide object for each peer.

I think based on experiences from the past I would even suggest going the other way around. We should first exchange channel reestablishment messages which could give us information whether our channel state is out of date and as a result not doing any channel-arbitrator actions for the explicit channel.

How would we detect if the peer is lying in the channel reestablish?

The peer could falsely claim to have a newer state, causing our node to attempt DLP and not start the channel arbitrator. Then the peer could steal funds onchain without our node detecting it.

The other problem is the attack surface widens dramatically after peers are allowed to connect -- peers can send anything they want over the connection, potentially crashing our node if there's a bug. If we always reconnect before starting channel arbitrator, the peer always gets a chance to crash us before we can detect stealing attempts.

ziggie1984 · 2024-03-28T17:20:32Z

How would we detect if the peer is lying in the channel reestablish?

From lightning/bolts#1049, I think you can tell that a peer is pretending to have a fake commitment state by not supplying the right your_last_per_commitment_secret for the state-1. For sure we can get the fake impression that we have the latest state when we don't have it, but at least the former case would prevent us from Force-Closing when our peer is honest and actually tells us that our state is outdated.

From the spec:

- if `next_revocation_number` is greater than expected above, AND
    `your_last_per_commitment_secret` is correct for that
    `next_revocation_number` minus 1:

The other problem is the attack surface widens dramatically after peers are allowed to connect -- peers can send anything they want over the connection, potentially crashing our node if there's a bug. If we always reconnect before starting channel arbitrator, the peer always gets a chance to crash us before we can detect stealing attempts.

I agree with that, I am inclined to also always think adversarial though I would also like to see some measurements in place which would prevent us harm ourselves. For example if our peer was honest and would have signaled us to not go onchain with an older state via the channel reestablishment msg.

morehouse · 2024-03-29T17:56:38Z

From lightning/bolts#1049, I think you can tell that a peer is pretending to have a fake commitment state by not supplying the right your_last_per_commitment_secret for the state-1. For sure we can get the fake impression that we have the latest state when we don't have it, but at least the former case would prevent us from Force-Closing when our peer is honest and actually tells us that our state is outdated.

Cool, thanks for explaining.

I agree with that, I am inclined to also always think adversarial though I would also like to see some measurements in place which would prevent us harm ourselves. For example if our peer was honest and would have signaled us to not go onchain with an older state via the channel reestablishment msg.

I think "safe mode" would be great for this, especially if we can enable it automatically when the node has been offline for a long time.

There's been a bit of discussion about this issue over at #8148 (comment), with different ideas about what should be done.

@ProofOfKeags mentioned isolating the contractcourt and sweeper entirely from the rest of LND, as a separate process. I think this is the holy grail for protecting funds under DoS, but it's a big project. It would be significantly more work than I intended to propose in this issue.

@Roasbeef suggested waiting for commitments and HTLCs to confirm on chain before allowing peer connections, providing similar isolation to running a separate process, but during startup only. The major downside of this is the 10+ minute delay (potentially hours) it would add to startup times before peers could connect.

I think we could get some quick incremental improvement over the current state without too much work or downside. Rather than waiting for transactions to confirm we could simply ensure they are broadcast before allowing peer connections. Concretely, we would do something like this:

Modify BreachArbitrator.start() to only return once initial justice transactions have been broadcast for any breaches loaded from the retribution store.
Modify ChainArbitrator.Start() to wait for all chainWatcher.Start() calls to finish before starting the ChannelArbitrators.
Modify chainWatcher.Start() to do a nonblocking check for spend notifications. If the funding outpoint was spent, handle it immediately without spawning a goroutine to wait for notifications.
Modify all ContractResolvers as follows:
- Introduce a new InitResolve() method to do any nonblocking steps of Resolve() (e.g., publishing transactions, checking preimage store).
- Reduce the scope of Resolve() to handling any blocking steps (e.g., waiting for transactions to confirm, waiting for new blocks).
Modify ChannelArbitrator.Start() to do a nonblocking check for chain events. If one is found, handle it immediately before spawning the channelAttendant goroutine. Any resolvers created should have InitResolve called and handled before spawning goroutines to do the Resolve() loops.
Modify server.Start() to call a new UtxoSweeper.ForceSweep() method after the UtxoSweeper, BreachArbitrator, and ChainArbitrator have started. ForceSweep will return after all pending sweeps have been broadcast.

With these changes, any transactions that need to be published at the current block should be published during startup. Goroutines are only spawned for future events, not things that can be handled immediately.

We'd want to do some benchmarking of startup times, but I suspect the effect will be relatively small.

morehouse added security General label for issues/PRs related to the security of the software server labels Nov 10, 2023

saubyk added P1 MUST be fixed or reviewed brainstorming Long term ideas/discussion/requests for feedback labels Nov 11, 2023

yyforyongyu added the advanced Issues suitable for very experienced developers label Nov 21, 2023

saubyk added P2 should be fixed if one has time and removed P1 MUST be fixed or reviewed labels Nov 21, 2023

Roasbeef mentioned this issue Feb 1, 2024

lnd: implement "safe mode" node stand up #3287

Open

3 tasks

morehouse mentioned this issue Mar 27, 2024

[4/5]sweep: deepen Aggregator interface #8148

Merged

3 tasks

xmrk-btc mentioned this issue Mar 30, 2024

[bug]: data loss and penalty txs after doing reboot -f while lnd had been stopped 30 minutes earlier #8607

Open

morehouse mentioned this issue May 6, 2024

mutil: introduce simple BlockBeat to fix itest flakes #8717

Draft

3 tasks

yyforyongyu mentioned this issue Jul 24, 2024

chainfee: allow specifying min relay feerate from the API source #8891

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hardening]: synchronously handle on-chain events during startup #8166

[hardening]: synchronously handle on-chain events during startup #8166

morehouse commented Nov 10, 2023

Roasbeef commented Nov 16, 2023

yyforyongyu commented Nov 21, 2023

ziggie1984 commented Mar 28, 2024

morehouse commented Mar 28, 2024 •

edited

Loading

ziggie1984 commented Mar 28, 2024 •

edited

Loading

morehouse commented Mar 29, 2024

[hardening]: synchronously handle on-chain events during startup #8166

[hardening]: synchronously handle on-chain events during startup #8166

Comments

morehouse commented Nov 10, 2023

Roasbeef commented Nov 16, 2023

yyforyongyu commented Nov 21, 2023

ziggie1984 commented Mar 28, 2024

morehouse commented Mar 28, 2024 • edited Loading

ziggie1984 commented Mar 28, 2024 • edited Loading

morehouse commented Mar 29, 2024

morehouse commented Mar 28, 2024 •

edited

Loading

ziggie1984 commented Mar 28, 2024 •

edited

Loading