Concurrent event handling #2381

MaxFangX · 2023-06-28T04:48:50Z

MaxFangX
Jun 28, 2023

Background

Right now, the EventsProvider trait states: "handlers MUST fully handle Events and persist any relevant changes to disk before returning".

However, this effectively prevents LDK users from implementing any form of concurrency that would allow them to reduce latency / response times when under high load and generating lots of Events. If 10 pending events are emitted in a call to process_events[_async], and all of these need to persist data e.g. by making a request to a database, all 10 calls have to be executed in serial, when in practice the Events may have little to no relation to each other and LDK users can handle these events more efficiently by processing them in parallel.

This property is important for us because we expect to have a single (or very few) LSPs handle and route payments for a large number of user nodes. It's not a big deal if user nodes have to process events serially, but our LSP needs to be able to process events concurrently, otherwise our users will experience latency spikes when the LSP is under high load.

Possible solutions

One solution that would work for us (let's call it the "events batch" approach) is to allow us to pass in a (optionally async) event handler that handles one batch of events at a time, e.g. by taking a Vec<Event> as input instead of just a single Event. Then, we could serialize and persist the batch of events in one call to our own DB, and safely return from the handle_event_batch function along with our promise that we'll handle the events eventually (possibly when the node restarts at a later point in time). Then, LDK can consider the events handled and repersist the channel manager / chain monitor with the pending events queue drained, while our own code works through the persisted events. If there is e.g. a network outage that prevents an event from being handled, we can just shut down our node and continue trying to handle the event at a later point in time.

An alternative approach (let's call it the "callback" approach) that would also work for us is like the channel monitor async persistence API: associate each Event with an id, then call back into the channel manager / chain monitor whenever we have finished processing any event. This would allow us to spawn off the handling of Events to a dedicated task immediately after they come in. When the channel manager / chain monitor are notified that an event has been handled, they are free to remove the event from their internal pending queues and trigger a repersist. If the node crashes after the LDK user's event handling tasks have handled the event but before the manager / monitor have had a chance to repersist, the events are reemitted upon startup and handled again, which is fine because the handlers are required to be idempotent.

More comments

Side note: Purposefully preventing `handle_event` from returning is unorthodox and probably not the best approach

Btw, in the current event handler contract, if the node determines that an event cannot be handled, e.g. due to a temporary lack of network access, LDK users have to use some strange mechanism like std::thread::park() or std::future::pending().await to ensure the call to handle_events[_async] never returns. Meanwhile, they may need to do some intricate coordination to gracefully shut down their node while keeping the event handling task / thread suspended, so that the manager / monitor never repersist and thus lose the unhandled event. The alternative approach of simply letting a handler retry its failing job indefinitely doesn't seem ideal either. Both of the two solutions outlined above allow the user to avoid this.

Latency and complexity considerations

The 'events batch' approach seems suboptimal from a latency standpoint: LDK generates an event, waits for a manager / monitor persist to ensure the event isn't lost, emits the batch of events to the user, the user persists the events batch, and only then does actual event handling begin. The response time to any event already has a minimum bound of how long it takes for the channel manager to repersist; exposing an events batch adds yet another persistence round trip to this minimum bound.

The 'callback' approach has better latency, but does seem more slightly complicated from an implementation standpoint. It also raises the question of when a pending Event should be reemitted by LDK - at the next call to process_pending_events[_async](), or after program memory has been cleared due to a node restart?

I'm open to all approaches here - just wanted to start a discussion, share some thoughts and observations, and communicate the constraints that matter most to us.

TheBlueMatt · 2023-06-28T14:23:00Z

TheBlueMatt
Jun 28, 2023
Maintainer

I wonder if we should just do this just for the async event handler, at least that would be a relatively trivial change - we already get all the pending events ready and then just blindly loop over them, it'd be pretty easy to just spawn all the futures and then await all of them.

0 replies

TheBlueMatt · 2023-06-28T14:55:41Z

TheBlueMatt
Jun 28, 2023
Maintainer

Side note: Purposefully preventing handle_event from returning is unorthodox and probably not the best approach

Right, this is definitely a mess, but also unclear what exactly to do about it. For most events if they aren't handled we have to replay them, but when do we replay? And in general if the application can't connect to their disk to write information about a received payment, should we really be running or should we just panic and crash?

0 replies

MaxFangX · 2023-06-28T19:51:54Z

MaxFangX
Jun 28, 2023
Author

it'd be pretty easy to just spawn all the futures and then await all of them

Just discussed this with @phlip9. We were under the impression that there might be some ordering constraints between the handling of certain event kinds, but after closer inspection we couldn't come up with any examples. Only wrinkle I could find was the HTLCIntercepted event, where the user might need to persist some associated data so that the event handler can return, allowing the channel manager to emit the ChannelReady event so that the channel is usable before calling forward_intercepted_htlc. But this is for a more advanced use of LDK (zeroconf channels), so overall we agree that simply doing event handling concurrently inside process_pending_events_async is an acceptable approach.

For most events if they aren't handled we have to replay them, but when do we replay? And in general if the application can't connect to their disk to write information about a received payment, should we really be running or should we just panic and crash?

Right, this is tricky. Suppose that we modify the async event handler so that users are required to return a Result<(), EventHandleError>. EventHandleError could then encode information about what the user would like the replay behavior to be, e.g.

enum EventHandleError {
    /// LDK will replay the event at the next call to `process_events_async`.
    // Maybe this enum variant isn't actually needed, since the user can just handle retries themselves?
    ReplayImmediately { err: String },
    /// LDK will replay the event the next time the `ChannelManager` is deserialized.
    // In other words, some in-memory bit is set to indicate that a particular event has been surfaced to
    // the user, which is cleared when the program shuts down. The event will not be replayed at the next
    // call to `process_events_async`.
    RetryAtNextBoot { err: String },
    /// There was an error but it was acceptable; LDK will log the error and consider the event as handled
    // Alternatively, the user can log the error themselves and just return `Ok`.
    NoRetry { err: String},
}

This then allows the user to make decisions about node running or shutdown themselves, including using the graceful shutdown mechanisms that they are likely to have implemented, without having to std::future::pending().await or similar. For example, a user that opts for RetryAtNextBoot is indicating to LDK "I've given up on handling this event during the current run of this node, please remind me to try again next time we restart". It is then the user's choice whether they would like the node to continue running in what they consider to be an acceptably degraded state, or if they would like the node to shut down.

I should also point out that ChannelManager::fail_htlc_backwards_with_reason is a different place in the LDK API where the user is allowed to specify their desired retry behavior, although this information is returned to the payer and does not appear to be used by LDK itself to automatically resurface a PaymentClaimable event.

I do think we should try to avoid panics if possible because many users will want to use their graceful shutdown mechanisms.

1 reply

TheBlueMatt Aug 11, 2023
Maintainer

Just discussed this with @phlip9. We were under the impression that there might be some ordering constraints between the handling of certain event kinds, but after closer inspection we couldn't come up with any examples. Only wrinkle I could find was the HTLCIntercepted event, where the user might need to persist some associated data so that the event handler can return, allowing the channel manager to emit the ChannelReady event so that the channel is usable before calling forward_intercepted_htlc. But this is for a more advanced use of LDK (zeroconf channels), so overall we agree that simply doing event handling concurrently inside process_pending_events_async is an acceptable approach.

Hmm, good point, there definitely used to be - payment events with an external retrier definitely had to be handled in-order. There's some risk when running things out of order that we retry a payment (via PendingHTLCsForwardable events) before handling the scoring updates (via PaymentPathFailed/Successful), which sucks, but isn't the end of the world (as we explicitly track where a payment failed and refuse to retake that path, specifically because of this race we've run into elsewhere in the past). Even a small delay in handling PendingHTLCsForwardable may resolve this.

There's also a few awkward cases where if we time out the user will see things happen out-of-order - an HTLC which we intercepted or payment we received and we took a few blocks to look at events again may result in handling an HTLCHandlingFailed before the HTLCIntercepted, or seeing a PaymentClaimable and failing to claim. But that's just kinda a bit awkward, not really a huge issue per se.

For example, a user that opts for RetryAtNextBoot is indicating to LDK "I've given up on handling this event during the current run of this node, please remind me to try again next time we restart". It is then the user's choice whether they would like the node to continue running in what they consider to be an acceptably degraded state, or if they would like the node to shut down.

I admit I'm not a big fan of this idea - if a user can't handle events they really need to shut down, they can't really run in a degraded state, aside from running ChannelMonitors by giving them blocks without a running ChannelManager (but even that is kinda unsafe as you may get a preimage on one monitor but not have it given to the upstream monitor).

I think just replaying all events on failure makes sense to me, absent refactoring the handling API to have different error semantics for different individual handler functions.

TheBlueMatt · 2023-08-11T23:16:50Z

TheBlueMatt
Aug 11, 2023
Maintainer

I opened two issues for this at #2491 and #2490

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lightning Dev Kit

Concurrent event handling #2381

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Lightning Dev Kit

Concurrent event handling #2381

MaxFangX Jun 28, 2023

Background

Possible solutions

More comments

Side note: Purposefully preventing handle_event from returning is unorthodox and probably not the best approach

Latency and complexity considerations

Replies: 4 comments · 1 reply

TheBlueMatt Jun 28, 2023 Maintainer

TheBlueMatt Jun 28, 2023 Maintainer

MaxFangX Jun 28, 2023 Author

TheBlueMatt Aug 11, 2023 Maintainer

TheBlueMatt Aug 11, 2023 Maintainer

MaxFangX
Jun 28, 2023

Side note: Purposefully preventing `handle_event` from returning is unorthodox and probably not the best approach

Replies: 4 comments 1 reply

TheBlueMatt
Jun 28, 2023
Maintainer

TheBlueMatt
Jun 28, 2023
Maintainer

MaxFangX
Jun 28, 2023
Author

TheBlueMatt Aug 11, 2023
Maintainer

TheBlueMatt
Aug 11, 2023
Maintainer