feat: deterministic longturtle serialisation using RDF canonicalization + n-triples sort #3008

edmondchuc · 2024-12-13T11:29:41Z

This PR improves upon #2997 to remove the bespoke object blank node sorting technique to instead use sorted n-triples str lines after applying the RGDA1 graph canonicalisation algorithm. Fixes #1890.

It's necessary to read in the sorted n-triples lines with skolemize=True to preserve the blank node identifiers from the canonicalisation algorithm.

Now that we can sort reliably by the blank node identifiers, this implementation works for all blank node positions in an RDF statement, no matter if it's in the subject or object position. It even works for blank nodes at the top-level.

@ajnelson-nist, I've added your blank node test from Sort Turtle output (#1978) and it's now passing, yay!

Update: this also fixes the double up of semicolons bug when the subject is a top-level blank node. See 412fb5d.

Checklist

Checked that there aren't other open pull requests for
the same change.
Checked that all tests and type checking passes.
If the change adds new features or changes the RDFLib public API:
- Created an issue to discuss the change and get in-principle agreement.
- Considered adding an example in ./examples.
If the change has a potential impact on users of this project:
- Added or updated tests that fail without the change.
- Updated relevant documentation to avoid inaccuracies.
- Considered adding additional documentation.
Considered granting push permissions to the PR branch,
so maintainers can fix minor issues and keep your PR up to date.

…rt to produce deterministic longturtle serialisation

coveralls · 2024-12-13T11:37:18Z

coverage: 90.274% (-0.002%) from 90.276%
when pulling 412fb5d on edmond/longturtle
into e8f61d4 on main.

test/test_serializers/test_serializer_longturtle_sort.py

ajnelson-nist · 2024-12-13T15:09:39Z

@edmondchuc Thank you for porting my test!

Though, I noticed there were some spots semicolons got doubled-up.

edmondchuc · 2024-12-13T16:09:51Z

@ajnelson-nist thanks for catching the double-up of the semicolons! I think it's fixed now.

edmondchuc added 2 commits December 13, 2024 22:15

feat: use the RGDA1 canonicalization algorithm + lexical n-triples so…

ceea737

…rt to produce deterministic longturtle serialisation

chore: normalise usage of format

e4845da

edmondchuc mentioned this pull request Dec 13, 2024

Sort Turtle output #1978

Draft

9 tasks

edmondchuc requested review from ashleysommer, nicholascar and recalcitrantsupplant December 13, 2024 11:30

chore: apply black

7405e32

sneakers-the-rat mentioned this pull request Dec 13, 2024

Possibility for deterministic order of multivalue slots in RDF output? linkml/linkml#1943

Open

ajnelson-nist reviewed Dec 13, 2024

View reviewed changes

test/test_serializers/test_serializer_longturtle_sort.py Outdated Show resolved Hide resolved

fix: double up of semicolons when subject is a blank node

412fb5d

edmondchuc requested a review from ajnelson-nist December 13, 2024 16:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deterministic longturtle serialisation using RDF canonicalization + n-triples sort #3008

feat: deterministic longturtle serialisation using RDF canonicalization + n-triples sort #3008

edmondchuc commented Dec 13, 2024 •

edited

Loading

coveralls commented Dec 13, 2024 •

edited

Loading

ajnelson-nist commented Dec 13, 2024

edmondchuc commented Dec 13, 2024

feat: deterministic longturtle serialisation using RDF canonicalization + n-triples sort #3008

Are you sure you want to change the base?

feat: deterministic longturtle serialisation using RDF canonicalization + n-triples sort #3008

Conversation

edmondchuc commented Dec 13, 2024 • edited Loading

Checklist

coveralls commented Dec 13, 2024 • edited Loading

ajnelson-nist commented Dec 13, 2024

edmondchuc commented Dec 13, 2024

edmondchuc commented Dec 13, 2024 •

edited

Loading

coveralls commented Dec 13, 2024 •

edited

Loading