Support streaming parsing of fragments (profile) #42

sandervd · 2023-08-24T15:29:30Z

While the most common read pattern for clients will be to read at the end of the log, from time to time new clients will show up that want to sync over the full history of the stream.
As I explained in #40, the issue in open fragments is the maximum size of a fragment has an exponential impact on the necessary bandwidth and processing in the client. This would argue for creating smaller fragments, however this also has downsides, as smaller fragments would mean more requests to the server.
This is why it would be ideal to rewrite historical fragments (say older than a day, immutable), into larger fragments.
Fetching bigger files (especially if the HTTP headers indicate relations, so this can add concurrency) is much more efficient than many smaller ones (connection setup, higher compression rate,...), but has a drawback in the current form: no tree:Node streaming parser exists, essentially requiring the entire graph (of one page) to be parsed in memory.
When compacting historical fragments into these larger graphs, this could be an issue.
This is why I would suggest a default way of structuring the data in a page, such that a stream aware parser can stream parse the document, and emit members as they are processed.
This would significantly reduce the memory requirements in the case of large fragments.

The layout of a page (say, using turtle serialization as it offers best compression) could look something like this:

First the stream membership statements, required to find the tree members
Then the members, one by one, ordered first by object id, then timestamp path
( this allows for member skipping if the client is interested in latest state only, reducing the number of upserts on the database the stream is projected in.)
Last the relation pointing to the next page

As all member triples are 'grouped', the parser can read one member at the time.

As the document would be a normal RDF file, and the only semantics added are there to support the streaming behavior, this should be completely backwards compatible for clients that don't support streaming tree parsing.
The capability could be indicated by a statement on the view.

sandervd · 2024-03-18T11:38:48Z

Perhaps we could create a LDES/protobuf serialization?

pietercolpaert · 2024-03-25T14:35:26Z

Valid point - the biggest problem is the member extraction algorithm at this moment that takes the full HTTP response as bounds of still potentially finding other quads. We’d need to extend existing serializations to indicate the bounds of a member in order to support streaming.

A protobuf LDES proto schema based on the SHACL would indeed be interesting. I’ll see whether we can find a master thesis on this

pietercolpaert mentioned this issue Apr 2, 2024

A standard JSON-LD context TREEcg/specification#108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support streaming parsing of fragments (profile) #42

Support streaming parsing of fragments (profile) #42

sandervd commented Aug 24, 2023

sandervd commented Mar 18, 2024

pietercolpaert commented Mar 25, 2024

Support streaming parsing of fragments (profile) #42

Support streaming parsing of fragments (profile) #42

Comments

sandervd commented Aug 24, 2023

sandervd commented Mar 18, 2024

pietercolpaert commented Mar 25, 2024