A dumps to feed processor for the RDF Connect framework. It can be run as part of a pipeline using the js-runner, or as a standalone CLI tool.
This processor is used to convert a dump of RDF data to a feed of RDF data.
As input, it takes a dump of RDF data and a SHACL shape that describes the members of the feed.
It will perform the member extraction algorithm using CBD and the SHACL shape to extract the members from the dump.
The extracted members are then compared to the members of the previous version of the dump to determine which members are new, updated, or deleted.
To compare the members, the processor first normalizes the members using the RDF Dataset Canonicalization (RDFC-1.0) algorithm, and then hashes the normalized members using the MD5 algorithm.
For a new member, the processor will add the member to the feed as an as:Create
activity.
For an updated member, the processor will add the member to the feed as an as:Update
activity.
A deleted member will be added to the feed as an as:Delete
activity.
The ActivityStreams 2.0 ontology (https://www.w3.org/ns/activitystreams#
) is used to describe the activities in the feed.
Under the hood, a file-based LevelDB database is used to store the members of the previous version of the dump. This database is used to compare the members of the new dump with the members of the previous dump.
git clone git@github.com:rdf-connect/dumps-to-feed-processor-ts.git
cd dumps-to-feed-feed-processor-ts
npm install
npm run build
npm install @rdfc/dumps-to-feed-processor-ts
node . sweden https://admin.dataportal.se/all.rdf https://semiceu.github.io/LDES-DCAT-AP-feeds/shape.ttl\#ActivityShape -o feed.ttl
An example pipeline configuration is provided in the example
folder. You can run it with the following command:
npx js-runner example/pipeline.ttl
The processor can be configured using the following parameters:
writer
: A writer to write the output feed to.feedname
: The name of the feed. Used internally to store the previous version of the feed such that you can use the processor for multiple feeds.flush
: Whether to flush the previous version of the feed. If set totrue
, the processor will start with an empty feed and add all members from the dump asas:Create
activities.dump
: A filename, URL, or serialized quads containing the dump of RDF data.dumpContentType
: The content type of the dump. Use 'identifier' in case of filename or url to be dereferenced.focusNodesStrategy
:extract
,sparql
, oriris
. Useextract
in case of automatic extraction (we will use a SPARQL query to find and extract all nodes of one of the DCAT-AP Feeds standalone entity types),sparql
in case of a provided SPARQL query, 'iris' in case of comma separated IRIs (NamedNode values)nodeShapeIri
: The IRI of the SHACL shape that describes the members of the feed.nodeShape
: The serialized SHACL shape intext/turtle
format that describes the members of the feed. Optional.focusNodes
: Comma separated list of IRIs of the NamedNodes as subjects that should be extracted, or a SPARQL query resolving into a list of entities to be used as focus nodes. Exact value depends on value offocusNodesStrategy
. Optional.dbDir
: The directory where the leveldb will be stored. Default is "./"
The SHACL definition of the processor can be found in processor.ttl
.
An example pipeline configuration is provided in the example
folder: example/pipeline.ttl
.
A full example of the processor in action for the Swedish DCAT-AP dump can be found here. This pipeline also contains the other processors set up to provide the dumps-to-feed-processor with the necessary data, and the processors to then write and publish the feed as a Linked Data Event Stream.