-
Notifications
You must be signed in to change notification settings - Fork 0
Harvesting related works
The DMP system maintains a set of Harvesters that run on a weekly basis to harvest information about new related works and grant awards that have occurred in various systems. The Harvesters then attempt to determine if any of those works or grants are related to any known DMPs.
We currently harvest DOIs from:
- DataCite, Crossref and OpenAlex
The algorithm that determines if a work is a potential match is fairly complex.
- Check for matching ORCIDs, ROR IDs, Crossref Funder Ids and Grant IDs (when they are available)
- Check for contributor names and affiliation names (when RORs and ORCIDs aren't available)
- Check for repository URLs or re3Data repository ids (when available)
- Use a NLP library to compare the DMP title and abstract to the related work's title and abstract
We weigh and score the results of the above to come up with a confidence score. We then present these findings to organizational administrators on the site so that they can assert whether or not a potential work is indeed related to the DMP.
Once a related work is found and has some score we add a record to the DynamoDB table that stores our DMP ID JSON:
{
"PK": "DMP#doi.org/10.48321/D114471AC3",
"SK": "HARVESTER_MODS",
"related_works": {
"https://doi.org/10.98765/0abc-9876": {
"citation": "The 2021 Conference on Empirical Methods in Illuminated Scripting 2021, Jane Doe. 2021. “Narrative Theory for Illuminated Narrative Appreciation.” [Article]. Journal of Medieval Manuscripts <a href=\"http://doi.org/10.98765/0abc-9876\" target=\"_blank\">http://doi.org/10.98765/0abc-9876</a>.",
"confidence": "Low",
"descriptor": "references",
"discovered_at": "2024-07-07T05:01:58Z",
"identifier": "10.98765/0abc-9876",
"logic": [
"contributor names matched"
],
"provenance": "Datacite via Journal of Medieval Manuscripts",
"score": 2,
"secondary_works": [
],
"status": "pending",
"type": "doi",
"work_type": "audiovisual"
},
"https://doi.org/10.1234/zenodo.f8g348g35g": {
"citation": "Doe, Jane and Smith, John. 2019. “Estimating Unobserved Cats within Medieval Manuscripts.” [Article]. <a href=\"http://doi.org/10.1234/zenodo.f8g348g35g\" target=\"_blank\">http://doi.org/10.1234/zenodo.f8g348g35g </a>.",
"confidence": "Low",
"descriptor": "references",
"discovered_at": "2024-07-07T05:02:01Z",
"identifier": "10.1234/zenodo.f8g348g35g",
"logic": [
"contributor names matched"
],
"provenance": "Datacite via Zenodo",
"score": 2,
"secondary_works": [
{
"descriptor": "is_version_of",
"identifier": "10.1234/zenodo.d52335d2",
"type": "DOI"
},
{
"descriptor": "is_part_of",
"identifier": "https://zenodo.org/communities/illuminated-manuscripts",
"type": "URL"
}
],
"status": "pending",
"type": "doi",
"work_type": "text"
}
},
"tstamp": "2024-05-06T23:01:36Z"
}
The code for the harvesters can be found in src/lambdas/harvesters
. The harvestable_dmps
lambda runs and queries the DynamoDB table ~ to find any DMPs that might have related works appearing out in the ecosystem~ for the hard-coded list of our pilot project DMPs.
The new DMPTool Pilot Project UI pages display a DMP's related works to the administrator where they can review and either "accept" or "reject" these related works.
When an augmentation is "accepted" within the UI, it is copied to the dmproadmap_related_identifiers
section of the DMP ID record. In all cases, a potential related work remains on the above HARVESTER_MODS
record. In subsequent loads of the UI page, only those entries marked as pending are displayed to the curator.