feat: deterministic longturtle serialisation using RDF canonicalization + n-triples sort #3008
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves upon #2997 to remove the bespoke object blank node sorting technique to instead use sorted n-triples str lines after applying the RGDA1 graph canonicalisation algorithm. Fixes #1890.
It's necessary to read in the sorted n-triples lines with
skolemize=True
to preserve the blank node identifiers from the canonicalisation algorithm.Now that we can sort reliably by the blank node identifiers, this implementation works for all blank node positions in an RDF statement, no matter if it's in the subject or object position. It even works for blank nodes at the top-level.
@ajnelson-nist, I've added your blank node test from Sort Turtle output (#1978) and it's now passing, yay!
Update: this also fixes the double up of semicolons bug when the subject is a top-level blank node. See 412fb5d.
Checklist
the same change.
./examples
.so maintainers can fix minor issues and keep your PR up to date.