Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engx242 identifierpatterns #13

Merged
merged 3 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions app/graphql/types/query_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -32,5 +32,14 @@ def log_search_event(search_term:, source_system:)
term = Term.create_or_find_by!(phrase: search_term)
term.search_events.create!(source: source_system)
end

field :lookup_term, TermType, null: true,
description: 'Lookup a term to return information about it (bypasses logging)' do
argument :search_term, String, required: true
end

def lookup_term(search_term:)
term = Term.find_by(phrase: search_term)
end
end
end
14 changes: 14 additions & 0 deletions app/graphql/types/search_event_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,19 @@ class SearchEventType < Types::BaseObject
field :source, String
field :created_at, GraphQL::Types::ISO8601DateTime, null: false
field :updated_at, GraphQL::Types::ISO8601DateTime, null: false
field :phrase, String
field :standard_identifiers, [StandardIdentifiersType]

def phrase
@object.term.phrase
end

def standard_identifiers
ids = []
StandardIdentifiers.new(@object.term.phrase).identifiers.each do |identifier|
ids << { kind: identifier.first, value: identifier.last }
end
ids
end
end
end
8 changes: 8 additions & 0 deletions app/graphql/types/standard_identifiers_type.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# frozen_string_literal: true

module Types
class StandardIdentifiersType < Types::BaseObject
field :kind, String, null: false
field :value, String, null: false
end
end
25 changes: 25 additions & 0 deletions app/graphql/types/term_type.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# frozen_string_literal: true

module Types
class TermType < Types::BaseObject
field :id, ID, null: false
field :created_at, GraphQL::Types::ISO8601DateTime, null: false
field :updated_at, GraphQL::Types::ISO8601DateTime, null: false
field :phrase, String, null: false
field :occurence_count, Integer
field :search_events, [SearchEventType], null: false
field :standard_identifiers, [StandardIdentifiersType]

def occurence_count
@object.search_events.count
end

def standard_identifiers
ids = []
StandardIdentifiers.new(@object.phrase).identifiers.each do |identifier|
ids << { kind: identifier.first, value: identifier.last }
end
ids
end
end
end
73 changes: 73 additions & 0 deletions app/models/standard_identifiers.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# frozen_string_literal: true

# StandardIdentifiers is a PatternDectector implementation that detects the identifiers DOI, ISBN, ISSN, PMID.
# See /docs/reference/pattern_detection_and_enhancement.md for details.
class StandardIdentifiers
attr_reader :identifiers

def initialize(term)
@identifiers = {}
term_pattern_checker(term)
strip_invalid_issns
end

private

def term_pattern_checker(term)
term_patterns.each_pair do |type, pattern|
@identifiers[type.to_sym] = match(pattern, term) if match(pattern, term).present?
end
end

# Note on the limitations of this implementation
# We only detect the first match of each pattern, so a search of "1234-5678 5678-1234" will not return two ISSNs as
# might be expected, but just "1234-5678". Using ruby's string.scan(pattern) may be worthwhile if we want to detect
# all possible matches instead of just the first. That may require a larger refactor though as initial tests of doing
# that change did result in unintended results so it was backed out for now.
def match(pattern, term)
pattern.match(term).to_s.strip
end

# term_patterns are regex patterns to be applied to the basic search box input
def term_patterns
jazairi marked this conversation as resolved.
Show resolved Hide resolved
{
isbn: /\b(ISBN-*(1[03])* *(: ){0,1})*(([0-9Xx][- ]*){13}|([0-9Xx][- ]*){10})\b/,
issn: /\b[0-9]{4}-[0-9]{3}[0-9xX]\b/,
pmid: /\b((pmid|PMID): (\d{7,8}))\b/,
doi: %r{\b10\.(\d+\.*)+/(([^\s.])+\.*)+\b}
}
end

def strip_invalid_issns
return unless @identifiers[:issn]

@identifiers[:issn] = nil unless validate_issn(@identifiers[:issn])
end

# validate_issn is only called when the regex for an ISSN has indicated an ISSN
# of sufficient format is present - but the regex does not attempt to
# validate that the check digit in the ISSN spec is correct. This method
# does that calculation, so we do not returned falsely detected ISSNs,
# like "2015-2019".
#
# The algorithm is defined at
# https://datatracker.ietf.org/doc/html/rfc3044#section-2.2
# An example calculation is shared at
# https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format
def validate_issn(candidate)
digits = candidate.gsub('-', '').chars[..6]
check_digit = candidate.last.downcase
sum = 0

digits.each_with_index do |digit, idx|
sum += digit.to_i * (8 - idx.to_i)
end

actual_digit = 11 - sum.modulo(11)
actual_digit = 'x' if actual_digit == 10

return true if actual_digit.to_s == check_digit.to_s

false
end
end
60 changes: 60 additions & 0 deletions docs/reference/pattern_detection_and_enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
## Pattern detection and metadata enhancement

A Pattern Detector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID (implemented in our StandardIdentifiers Class). Other techniques than regular expressions may also occur as Pattern Detectors, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses.

A Pattern Detector is only run when the incoming data has requested this type of information to be returned. This will take the form of requesting specific fields to be returned via GraphQL that require using Pattern Detector to populate.

An appropriate Enhancer for the specific Pattern will add more detailed metadata if requested via GraphQL. This will allow the slowest portion of this data flow -- the external data lookups (Enhancers) -- to only be run if the caller has specifically asked for that data. Some users may only be interested in knowing that patterns were found and what they were, whereas others are willing to wait longer for more detailed information. And others still won't be interested in either. **The incoming GraphQL will be the driver of which algorithms we run, and which external data we request.**

```mermaid
---
title: "Pattern Detector: detecting known patterns and selectively enhancing the output"
---
flowchart LR
accTitle: "Pattern Detector: detecting known patterns and selectively enhancing the output"
accDescr: A flow chart showing how input is analyzed for patterns and decisions are made based on what was found. The workflow is described fully in the paragraphs of text following this diagram.
input(input)
detect[PatternDetector]
lookup[(DataLookup)]
enhance(enhance)
found{found?}
details{details requested?}
metadata{metadata found?}
annotate[[annotate]]
output
enhance --> output
subgraph PatternDetector
direction TB
detect --doi--> found
detect --issn--> found
detect --isbn----> found
detect --journal title--> found
detect --pmid--> found
annotate
end
subgraph Enhancer
lookup --> metadata
metadata -- yes --> enhance[[enhance]]
enhance
end
input --> PatternDetector
metadata -- no --> output
found -- no --> output
found -- yes --> annotate
annotate --> details
details -- no --> output
details -- yes --> lookup
output
```

When receiving an input, first we detect known patterns such as DOI, ISSN, ISBN, PMID, or Journal Titles.

If we do not find any, we exit the flow with an empty output.

If we find one more more patterns, we annotate the eventual response with what we found. If the original input did not request details for found patterns, we return the annotated response with what we found.

If the original input did request details for found patterns, we lookup information. If we do not find additional information, we return the annotated output. If we do find additional information, we enhance the annotation with the metadata we have found and return that in the output.
17 changes: 17 additions & 0 deletions test/controllers/graphql_controller_test.rb
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# frozen_string_literal: true

require 'test_helper'

class GraphqlControllerTest < ActionDispatch::IntegrationTest
Expand Down Expand Up @@ -48,4 +50,19 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
assert_equal(200, response.status)
assert_equal Term.count, initial_term_count
end

test 'search event query can return detected standard identifiers' do
post '/graphql', params: { query: '{
logSearchEvent(sourceSystem: "timdex", searchTerm: "10.1038/nphys1170") {
standardIdentifiers {
kind
value
}
}
}' }

json = JSON.parse(response.body)
assert_equal('doi', json['data']['logSearchEvent']['standardIdentifiers'].first['kind'])
assert_equal('10.1038/nphys1170', json['data']['logSearchEvent']['standardIdentifiers'].first['value'])
end
end
Loading
Loading