Skip to content

Commit

Permalink
Updates documentation to clarify StandardIdentifiers are an implement…
Browse files Browse the repository at this point in the history
…ation of PatternDetection
  • Loading branch information
JPrevost committed Nov 17, 2023
1 parent 306c72e commit d6a51b2
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 15 deletions.
25 changes: 15 additions & 10 deletions app/models/standard_identifiers.rb
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# frozen_string_literal: true

# StandardIdentifiers is a PatternDectector implementation that detects the identifiers DOI, ISBN, ISSN, PMID.
# See /docs/reference/pattern_detection_and_enhancement.md for details.
class StandardIdentifiers
attr_reader :identifiers

Expand Down Expand Up @@ -42,25 +44,28 @@ def strip_invalid_issns
@identifiers[:issn] = nil unless validate_issn(@identifiers[:issn])
end

# validate_issn is only called when the regex for an ISSN has indicated an ISSN
# of sufficient format is present - but the regex does not attempt to
# validate that the check digit in the ISSN spec is correct. This method
# does that calculation, so we do not returned falsely detected ISSNs,
# like "2015-2019".
#
# The algorithm is defined at
# https://datatracker.ietf.org/doc/html/rfc3044#section-2.2
# An example calculation is shared at
# https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format
def validate_issn(candidate)
# This method is only called when the regex for an ISSN has indicated an ISSN
# of sufficient format is present - but the regex does not attempt to
# validate that the check digit in the ISSN spec is correct. This method
# does that calculation, so we do not returned falsely detected ISSNs,
# like "2015-2019".
#
# The algorithm is defined at
# https://datatracker.ietf.org/doc/html/rfc3044#section-2.2
# An example calculation is shared at
# https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format
digits = candidate.gsub('-', '').chars[..6]
check_digit = candidate.last.downcase
sum = 0

digits.each_with_index do |digit, idx|
sum += digit.to_i * (8 - idx.to_i)
end

actual_digit = 11 - sum.modulo(11)
actual_digit = 'x' if actual_digit == 10

return true if actual_digit.to_s == check_digit.to_s

false
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
## Pattern detection and metadata enhancement

The PatternDetector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID. Other techniques than regular expressions may also occur here, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses.
A PatternDetector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID (implemented in our StandardIdentifiers Class). Other techniques than regular expressions may also occur as PatternDetectors, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses.

PatternDetector is only run when the incoming data has requested this type of information to be returned. This will take the form of requesting specific fields to be returned in the GraphQL that require using PatternDetector to populate.
A PatternDetector is only run when the incoming data has requested this type of information to be returned. This will take the form of requesting specific fields to be returned via GraphQL that require using PatternDetector to populate.

PatternDetector will hand off to the Enhancer if detailed metadata is requested via GraphQL. This will allow the slowest portion of this data flow -- the external data lookups -- to only be run if the caller has specifically asked for that data. Some users may only be interested in knowing that patterns were found and what they were, whereas others are willing to wait longer for more detailed information.
An appropriate Enhancer for the specific Pattern will add more detailed metadata if requested via GraphQL. This will allow the slowest portion of this data flow -- the external data lookups (Enhancers) -- to only be run if the caller has specifically asked for that data. Some users may only be interested in knowing that patterns were found and what they were, whereas others are willing to wait longer for more detailed information. And others still won't be interested in either. **The incoming GraphQL will be the driver of which algorithms we run, and which external data we request.**

```mermaid
---
Expand All @@ -23,6 +23,7 @@ flowchart LR
metadata{metadata found?}
annotate[[annotate]]
output
enhance --> output
subgraph PatternDetector
direction TB
Expand All @@ -31,11 +32,12 @@ flowchart LR
detect --isbn----> found
detect --journal title--> found
detect --pmid--> found
annotate
end
subgraph DataLookup
subgraph Enhancer
lookup --> metadata
metadata -- yes --> enhance
metadata -- yes --> enhance[[enhance]]
enhance
end
Expand Down

0 comments on commit d6a51b2

Please sign in to comment.