Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homeserver event stream #17

Open
Nuhvi opened this issue Jun 5, 2024 · 5 comments
Open

Homeserver event stream #17

Nuhvi opened this issue Jun 5, 2024 · 5 comments
Labels
datastore The data storage, replication, and gc in the homeserver

Comments

@Nuhvi
Copy link
Collaborator

Nuhvi commented Jun 5, 2024

Goals

To decouple Indexers from Homeservers, and for general subscriptions/notifications usecases, we need an event stream interface that notify any subscriper that a resource has been updated.

Subscriptions need to a bit more dynamic than what we had in the web-relay (where you can only subscripe to exacte file), but it needs to be simple enough that the Homeserver doesn't need to do too much work to match data events to their subscriptions.

Users should be able to:

  1. Subscribe to all events from a certain user.
  2. Subscribe to all events in a named folder (or prefix of a path) from a certain user.
  3. Subscribe to all events in a named folder (or prefix of a path) from all users.
  4. Subscribe to all public events from everyone.

Design

Request

The API for subscriptions is suggested to be a concatenation of <user pubky><prefix>:

  1. User pubky; can be either Exact or Wildcard
  2. Prefix; where any entry that shares that prefix is matched.

Wildcard example:

*/pubky.app/

Exact example:

3oyftkf7yc154znnri8u888wfhww7zb54mirr9gi1ricxrfck7po/pubky.app/

Servers then store subscriptions in a key-value store, separtated to two logical tables (wildcards and exact ids), for example:

"*/foo/bar" => HashSet<Subscription>,
"3oyftkf7yc154znnri8u888wfhww7zb54mirr9gi1ricxrfck7po/foo/bar" => HashSet<Subscription>,

Once a new operation is done (PUT/DELETE), the server iterates over all wildcards and exact tables, until the prefixes no longer match the current event's path.

This way, the server only iterates over subscriptions that match the new event, instead of linearily checking all active subscriptions. More importantly these checks are lexographic so easy to reason about and compute.

Event

An event is encoded in a Server-Sent Events as follows:

id: <event-id>
data: <event type> <user-id>/<path> [<future data>]
  • event-id: (u64 encoded as base32) is the sequence of the event internally to the server, which allows subscripers to resume after a disconnection from the last Seq they saw if the server didn't discard these old events.
  • event type: PUT or DEL
  • user-id: Pubky
  • path: for example /pubky.app/potsts/XXXX
  • future data: once we implement a copy-on-write Merkle Treap, we can also share the Seq of the treap as a version of the file, and maybe the hash of the root or the hash of the value of the resource!

Example

Subscribe to all events from a certain user.

Request

GET /stream?q=3oyftkf7yc154znnri8u888wfhww7zb54mirr9gi1ricxrfck7po/  HTTP/2
Last-Event-Id: 7

Response

id: 8
data: PUT 3oyftkf7yc154znnri8u888wfhww7zb54mirr9gi1ricxrfck7po/foo

id: 10
data: PUT 3oyftkf7yc154znnri8u888wfhww7zb54mirr9gi1ricxrfck7po/bar

Subscribe to all events in a named folder (or prefix of a path) from a certain user.

Request

GET /stream?q=3oyftkf7yc154znnri8u888wfhww7zb54mirr9gi1ricxrfck7po/fo  HTTP/2
Last-Event-Id: 7

Response

id: 8
data: PUT 3oyftkf7yc154znnri8u888wfhww7zb54mirr9gi1ricxrfck7po/foo

Subscribe to all events in a named folder (or prefix of a path) from all users.

Request

GET /stream?q=*/fo  HTTP/2
Last-Event-Id: 7

Response

id: 8
data: PUT 3oyftkf7yc154znnri8u888wfhww7zb54mirr9gi1ricxrfck7po/foo

id: 11
data: PUT wfhww7zb54mirr9gi1ricxrfck7po3oyftkf7yc154znnri8u888/foo

Subscribe to all public events from everyone.

Request

GET /stream?q=*/  HTTP/2
Last-Event-Id: 7

Response

id: 8
data: PUT 3oyftkf7yc154znnri8u888wfhww7zb54mirr9gi1ricxrfck7po/foo

id: 10
data: PUT 3oyftkf7yc154znnri8u888wfhww7zb54mirr9gi1ricxrfck7po/bar

id: 11
data: PUT wfhww7zb54mirr9gi1ricxrfck7po3oyftkf7yc154znnri8u888/foo

id: 12
data: PUT wfhww7zb54mirr9gi1ricxrfck7po3oyftkf7yc154znnri8u888/bar
@Nuhvi
Copy link
Collaborator Author

Nuhvi commented Jun 5, 2024

The event-id needs to be monotonically increasing, and it needs to be sorted, both can be achieved by using the Snapshot's timestamp of the operation, which would probably save on the future data part (unless we need to add the hash).

But it also needs to unique, because a relay might be aggregating all these Events from multiple servers, and while the Timestamp tries to be globally unique, that is never granted or to be trusted ... some homeserver might deliberately choose clock-ids other chosen to confuse the relay, even if the clock-id bits are large enough.

So maybe we should use the user ID / pubky as a part of the event id:

id: <base32 encoded Timestamp of the snapshot>:<user id / pubky>
data: <event-type> <path>
id: 2ZB0PB2KVV700:wfhww7zb54mirr9gi1ricxrfck7po3oyftkf7yc154znnri8u888
data: PUT /bar

The downside of this is just more parsing, unless we keep the user Id in the data line, and just rely on compression.

A compromise might be to only add some of the user-id to break collision if/when they happen, on the relay side, for example:

id: 2ZB0PB2KVV700
data: PUT pk:wfhoyftkf7yc154znnri8u888ww7zb54mirr9gi1ricxrfck7po3/foo

id: 2ZB0PB2KVV700:w
data: PUT pk:wfhoyftk7zb54mirr9gi1ricxrfck7po3f7yc154znnri8u888ww/bar

id: 2ZB0PB2KVV700:wf
data: PUT pk:wfhww7zb54mirr9gi1ricxrfck7po3oyftkf7yc154znnri8u888/zar

This way we only add as few characters as needed for uniqueness, and the homeserver itself could use a u64 as keys without any uniqueness bytes or base32 encoding. Although that will cost more encoding on the fly.

By adding pk:, the data becomes a URI that is immediately ready to store in a database and parse later.

The downside of this, is that you need to parse the event-id to get the snapshot timestamp, but since you don't need it for most operations, that should be fine.

@Nuhvi
Copy link
Collaborator Author

Nuhvi commented Jun 5, 2024

Also a relay could calculate the longest matching leading bytes in all its pubkys, and use that to determine how many unique characters it needs to add after the timestamp in the event-id

@Nuhvi
Copy link
Collaborator Author

Nuhvi commented Jun 5, 2024

Another way to make event-id globally unique, is to format it as <snapshot timestamp>:<hash(URI || content [|| metadata])> where the content is empty if the event is DEL.

This encoding could also serve as a strong link that can be appended to URIs to point to exact post at a certain verifiable version, for example:

reply = {
    parent: "pk:wfhww7zb54mirr9gi1ricxrfck7po3oyftkf7yc154znnri8u888/pubky.app/posts/2ZB0PB2KVV700?etag=2ZB0PB2KVV700_<hash of the parent in hex or base32>"
}

@Nuhvi
Copy link
Collaborator Author

Nuhvi commented Aug 26, 2024

A rudimentary version is implemented at #27, there is no realtime subscription api, but you can list URLs by their timestamp and paginate through them.

@Nuhvi Nuhvi added the datastore The data storage, replication, and gc in the homeserver label Aug 26, 2024
@Nuhvi
Copy link
Collaborator Author

Nuhvi commented Nov 4, 2024

Implemented a fire hose subscription in #54 ... no plans to add granular subscriptions yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datastore The data storage, replication, and gc in the homeserver
Projects
None yet
Development

No branches or pull requests

1 participant