Replies: 4 comments 1 reply
-
I wonder if we should just do this just for the async event handler, at least that would be a relatively trivial change - we already get all the pending events ready and then just blindly loop over them, it'd be pretty easy to just spawn all the futures and then await all of them. |
Beta Was this translation helpful? Give feedback.
-
Right, this is definitely a mess, but also unclear what exactly to do about it. For most events if they aren't handled we have to replay them, but when do we replay? And in general if the application can't connect to their disk to write information about a received payment, should we really be running or should we just panic and crash? |
Beta Was this translation helpful? Give feedback.
-
Just discussed this with @phlip9. We were under the impression that there might be some ordering constraints between the handling of certain event kinds, but after closer inspection we couldn't come up with any examples. Only wrinkle I could find was the
Right, this is tricky. Suppose that we modify the async event handler so that users are required to return a enum EventHandleError {
/// LDK will replay the event at the next call to `process_events_async`.
// Maybe this enum variant isn't actually needed, since the user can just handle retries themselves?
ReplayImmediately { err: String },
/// LDK will replay the event the next time the `ChannelManager` is deserialized.
// In other words, some in-memory bit is set to indicate that a particular event has been surfaced to
// the user, which is cleared when the program shuts down. The event will not be replayed at the next
// call to `process_events_async`.
RetryAtNextBoot { err: String },
/// There was an error but it was acceptable; LDK will log the error and consider the event as handled
// Alternatively, the user can log the error themselves and just return `Ok`.
NoRetry { err: String},
} This then allows the user to make decisions about node running or shutdown themselves, including using the graceful shutdown mechanisms that they are likely to have implemented, without having to I should also point out that I do think we should try to avoid panics if possible because many users will want to use their graceful shutdown mechanisms. |
Beta Was this translation helpful? Give feedback.
-
Background
Right now, the
EventsProvider
trait states: "handlers MUST fully handleEvent
s and persist any relevant changes to disk before returning".However, this effectively prevents LDK users from implementing any form of concurrency that would allow them to reduce latency / response times when under high load and generating lots of
Event
s. If 10 pending events are emitted in a call toprocess_events[_async]
, and all of these need to persist data e.g. by making a request to a database, all 10 calls have to be executed in serial, when in practice theEvent
s may have little to no relation to each other and LDK users can handle these events more efficiently by processing them in parallel.This property is important for us because we expect to have a single (or very few) LSPs handle and route payments for a large number of user nodes. It's not a big deal if user nodes have to process events serially, but our LSP needs to be able to process events concurrently, otherwise our users will experience latency spikes when the LSP is under high load.
Possible solutions
One solution that would work for us (let's call it the "events batch" approach) is to allow us to pass in a (optionally async) event handler that handles one batch of events at a time, e.g. by taking a
Vec<Event>
as input instead of just a singleEvent
. Then, we could serialize and persist the batch of events in one call to our own DB, and safely return from thehandle_event_batch
function along with our promise that we'll handle the events eventually (possibly when the node restarts at a later point in time). Then, LDK can consider the events handled and repersist the channel manager / chain monitor with the pending events queue drained, while our own code works through the persisted events. If there is e.g. a network outage that prevents an event from being handled, we can just shut down our node and continue trying to handle the event at a later point in time.An alternative approach (let's call it the "callback" approach) that would also work for us is like the channel monitor async persistence API: associate each
Event
with an id, then call back into the channel manager / chain monitor whenever we have finished processing any event. This would allow us to spawn off the handling ofEvent
s to a dedicated task immediately after they come in. When the channel manager / chain monitor are notified that an event has been handled, they are free to remove the event from their internal pending queues and trigger a repersist. If the node crashes after the LDK user's event handling tasks have handled the event but before the manager / monitor have had a chance to repersist, the events are reemitted upon startup and handled again, which is fine because the handlers are required to be idempotent.More comments
Side note: Purposefully preventing
handle_event
from returning is unorthodox and probably not the best approachBtw, in the current event handler contract, if the node determines that an event cannot be handled, e.g. due to a temporary lack of network access, LDK users have to use some strange mechanism like
std::thread::park()
orstd::future::pending().await
to ensure the call tohandle_events[_async]
never returns. Meanwhile, they may need to do some intricate coordination to gracefully shut down their node while keeping the event handling task / thread suspended, so that the manager / monitor never repersist and thus lose the unhandled event. The alternative approach of simply letting a handler retry its failing job indefinitely doesn't seem ideal either. Both of the two solutions outlined above allow the user to avoid this.Latency and complexity considerations
The 'events batch' approach seems suboptimal from a latency standpoint: LDK generates an event, waits for a manager / monitor persist to ensure the event isn't lost, emits the batch of events to the user, the user persists the events batch, and only then does actual event handling begin. The response time to any event already has a minimum bound of how long it takes for the channel manager to repersist; exposing an events batch adds yet another persistence round trip to this minimum bound.
The 'callback' approach has better latency, but does seem more slightly complicated from an implementation standpoint. It also raises the question of when a pending
Event
should be reemitted by LDK - at the next call toprocess_pending_events[_async]()
, or after program memory has been cleared due to a node restart?I'm open to all approaches here - just wanted to start a discussion, share some thoughts and observations, and communicate the constraints that matter most to us.
Beta Was this translation helpful? Give feedback.
All reactions