Adds standard identifiers detection

Why are these changes being introduced: Detection of patterns will eventually help us with categorizing the incoming search terms, which is the ultimate goal of this application. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/ENGX-242 * https://mitlibraries.atlassian.net/browse/ENGX-241 The StandardIdentifier patterns and tests were originally developed as part of our TIMDEX UI application and have been extracted and modified for use in this application. https://github.com/MITLibraries/timdex-ui co-authored-by: Matt Bernhardt <mjbernha@mit.edu>
MITLibraries · Nov 17, 2023 · f14464e · f14464e
1 parent f301976
commit f14464e
Show file tree

Hide file tree

Showing 7 changed files with 353 additions and 0 deletions.
diff --git a/app/graphql/types/search_event_type.rb b/app/graphql/types/search_event_type.rb
@@ -8,8 +8,18 @@ class SearchEventType < Types::BaseObject
     field :created_at, GraphQL::Types::ISO8601DateTime, null: false
     field :updated_at, GraphQL::Types::ISO8601DateTime, null: false
     field :phrase, String
+    field :standard_identifiers, [StandardIdentifiersType]
+
     def phrase
       @object.term.phrase
     end
+
+    def standard_identifiers
+      ids = []
+      StandardIdentifiers.new(@object.term.phrase).identifiers.each do |identifier|
+        ids << { kind: identifier.first, value: identifier.last }
+      end
+      ids
+    end
   end
 end
diff --git a/app/graphql/types/standard_identifiers_type.rb b/app/graphql/types/standard_identifiers_type.rb
@@ -0,0 +1,8 @@
+# frozen_string_literal: true
+
+module Types
+  class StandardIdentifiersType < Types::BaseObject
+    field :kind, String, null: false
+    field :value, String, null: false
+  end
+end
diff --git a/app/graphql/types/term_type.rb b/app/graphql/types/term_type.rb
@@ -8,9 +8,18 @@ class TermType < Types::BaseObject
     field :phrase, String, null: false
     field :occurence_count, Integer
     field :search_events, [SearchEventType], null: false
+    field :standard_identifiers, [StandardIdentifiersType]
 
     def occurence_count
       @object.search_events.count
     end
+
+    def standard_identifiers
+      ids = []
+      StandardIdentifiers.new(@object.phrase).identifiers.each do |identifier|
+        ids << { kind: identifier.first, value: identifier.last }
+      end
+      ids
+    end
   end
 end
diff --git a/app/models/standard_identifiers.rb b/app/models/standard_identifiers.rb
@@ -0,0 +1,73 @@
+# frozen_string_literal: true
+
+# StandardIdentifiers is a PatternDectector implementation that detects the identifiers DOI, ISBN, ISSN, PMID.
+# See /docs/reference/pattern_detection_and_enhancement.md for details.
+class StandardIdentifiers
+  attr_reader :identifiers
+
+  def initialize(term)
+    @identifiers = {}
+    term_pattern_checker(term)
+    strip_invalid_issns
+  end
+
+  private
+
+  def term_pattern_checker(term)
+    term_patterns.each_pair do |type, pattern|
+      @identifiers[type.to_sym] = match(pattern, term) if match(pattern, term).present?
+    end
+  end
+
+  # Note on the limitations of this implementation
+  # We only detect the first match of each pattern, so a search of "1234-5678 5678-1234" will not return two ISSNs as
+  # might be expected, but just "1234-5678". Using ruby's string.scan(pattern) may be worthwhile if we want to detect
+  # all possible matches instead of just the first. That may require a larger refactor though as initial tests of doing
+  # that change did result in unintended results so it was backed out for now.
+  def match(pattern, term)
+    pattern.match(term).to_s.strip
+  end
+
+  # term_patterns are regex patterns to be applied to the basic search box input
+  def term_patterns
+    {
+      isbn: /\b(ISBN-*(1[03])* *(: ){0,1})*(([0-9Xx][- ]*){13}|([0-9Xx][- ]*){10})\b/,
+      issn: /\b[0-9]{4}-[0-9]{3}[0-9xX]\b/,
+      pmid: /\b((pmid|PMID): (\d{7,8}))\b/,
+      doi: %r{\b10\.(\d+\.*)+/(([^\s.])+\.*)+\b}
+    }
+  end
+
+  def strip_invalid_issns
+    return unless @identifiers[:issn]
+
+    @identifiers[:issn] = nil unless validate_issn(@identifiers[:issn])
+  end
+
+  # validate_issn is only called when the regex for an ISSN has indicated an ISSN
+  # of sufficient format is present - but the regex does not attempt to
+  # validate that the check digit in the ISSN spec is correct. This method
+  # does that calculation, so we do not returned falsely detected ISSNs,
+  # like "2015-2019".
+  #
+  # The algorithm is defined at
+  # https://datatracker.ietf.org/doc/html/rfc3044#section-2.2
+  # An example calculation is shared at
+  # https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format
+  def validate_issn(candidate)
+    digits = candidate.gsub('-', '').chars[..6]
+    check_digit = candidate.last.downcase
+    sum = 0
+
+    digits.each_with_index do |digit, idx|
+      sum += digit.to_i * (8 - idx.to_i)
+    end
+
+    actual_digit = 11 - sum.modulo(11)
+    actual_digit = 'x' if actual_digit == 10
+
+    return true if actual_digit.to_s == check_digit.to_s
+
+    false
+  end
+end
diff --git a/docs/reference/pattern_detection_and_enhancement.md b/docs/reference/pattern_detection_and_enhancement.md
@@ -0,0 +1,60 @@
+## Pattern detection and metadata enhancement
+
+A Pattern Detector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID (implemented in our StandardIdentifiers Class). Other techniques than regular expressions may also occur as Pattern Detectors, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses.
+
+A Pattern Detector is only run when the incoming data has requested this type of information to be returned. This will take the form of requesting specific fields to be returned via GraphQL that require using Pattern Detector to populate.
+
+An appropriate Enhancer for the specific Pattern will add more detailed metadata if requested via GraphQL. This will allow the slowest portion of this data flow -- the external data lookups (Enhancers) -- to only be run if the caller has specifically asked for that data. Some users may only be interested in knowing that patterns were found and what they were, whereas others are willing to wait longer for more detailed information. And others still won't be interested in either. **The incoming GraphQL will be the driver of which algorithms we run, and which external data we request.**
+
+```mermaid
+---
+title: "Pattern Detector: detecting known patterns and selectively enhancing the output"
+---
+flowchart LR
+  accTitle: "Pattern Detector: detecting known patterns and selectively enhancing the output"
+  accDescr: A flow chart showing how input is analyzed for patterns and decisions are made based on what was found. The workflow is described fully in the paragraphs of text following this diagram.
+
+  input(input)
+  detect[PatternDetector]
+  lookup[(DataLookup)]
+  enhance(enhance)
+  found{found?}
+  details{details requested?}
+  metadata{metadata found?}
+  annotate[[annotate]]
+  output
+  enhance --> output
+
+  subgraph PatternDetector
+    direction TB
+      detect --doi--> found
+      detect --issn--> found
+      detect --isbn----> found
+      detect --journal title--> found
+      detect --pmid--> found
+      annotate
+  end
+
+  subgraph Enhancer
+    lookup --> metadata
+    metadata -- yes --> enhance[[enhance]]
+    enhance
+  end
+
+  input --> PatternDetector
+  metadata -- no --> output
+  found -- no --> output
+  found -- yes --> annotate
+  annotate --> details
+  details -- no --> output
+  details -- yes --> lookup
+  output
+```
+
+When receiving an input, first we detect known patterns such as DOI, ISSN, ISBN, PMID, or Journal Titles.
+
+If we do not find any, we exit the flow with an empty output.
+
+If we find one more more patterns, we annotate the eventual response with what we found. If the original input did not request details for found patterns, we return the annotated response with what we found.
+
+If the original input did request details for found patterns, we lookup information. If we do not find additional information, we return the annotated output. If we do find additional information, we enhance the annotation with the metadata we have found and return that in the output.
diff --git a/test/controllers/graphql_controller_test.rb b/test/controllers/graphql_controller_test.rb
@@ -1,3 +1,5 @@
+# frozen_string_literal: true
+
 require 'test_helper'
 
 class GraphqlControllerTest < ActionDispatch::IntegrationTest
@@ -48,4 +50,19 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
     assert_equal(200, response.status)
     assert_equal Term.count, initial_term_count
   end
+
+  test 'search event query can return detected standard identifiers' do
+    post '/graphql', params: { query: '{
+                                 logSearchEvent(sourceSystem: "timdex", searchTerm: "10.1038/nphys1170") {
+                                  standardIdentifiers {
+                                        kind
+                                        value
+                                  }
+                                 }
+                               }' }
+
+    json = JSON.parse(response.body)
+    assert_equal('doi', json['data']['logSearchEvent']['standardIdentifiers'].first['kind'])
+    assert_equal('10.1038/nphys1170', json['data']['logSearchEvent']['standardIdentifiers'].first['value'])
+  end
 end
diff --git a/test/models/standard_identifiers_test.rb b/test/models/standard_identifiers_test.rb
@@ -0,0 +1,176 @@
+# frozen_string_literal: true
+
+require 'test_helper'
+
+class StandardIdentifiersTest < ActiveSupport::TestCase
+  test 'ISBN detected in a string' do
+    actual = StandardIdentifiers.new('test 978-3-16-148410-0 test').identifiers
+
+    assert_equal('978-3-16-148410-0', actual[:isbn])
+  end
+
+  test 'ISBN-10 examples' do
+    # from wikipedia
+    samples = ['99921-58-10-7', '9971-5-0210-0', '960-425-059-0', '80-902734-1-6', '85-359-0277-5',
+               '1-84356-028-3', '0-684-84328-5', '0-8044-2957-X', '0-85131-041-9', '93-86954-21-4', '0-943396-04-2',
+               '0-9752298-0-X']
+
+    samples.each do |isbn|
+      actual = StandardIdentifiers.new(isbn).identifiers
+      assert_equal(isbn, actual[:isbn])
+    end
+  end
+
+  test 'ISBN-13 examples' do
+    samples = ['978-99921-58-10-7', '979-9971-5-0210-0', '978-960-425-059-0', '979-80-902734-1-6', '978-85-359-0277-5',
+               '979-1-84356-028-3', '978-0-684-84328-5', '979-0-8044-2957-X', '978-0-85131-041-9', '979-93-86954-21-4',
+               '978-0-943396-04-2', '979-0-9752298-0-X']
+
+    samples.each do |isbn|
+      actual = StandardIdentifiers.new(isbn).identifiers
+      assert_equal(isbn, actual[:isbn])
+    end
+  end
+
+  test 'not ISBNs' do
+    samples = ['orange cats like popcorn', '1234-6798', 'another ISBN not found here']
+
+    samples.each do |notisbn|
+      actual = StandardIdentifiers.new(notisbn).identifiers
+      assert_nil(actual[:isbn])
+    end
+  end
+
+  test 'ISBNs need boundaries' do
+    samples = ['990026671500206761', '979-0-9752298-0-XYZ']
+    # note, there is a theoretical case of `asdf979-0-9752298-0-X` returning as an ISBN 10 even though it doesn't
+    # have a word boundary because the `-` is treated as a boundary so `0-9752298-0-X` would be an ISBN10. We can
+    # consider whether we care in the future as we look for incorrect real-world matches.
+
+    samples.each do |notisbn|
+      actual = StandardIdentifiers.new(notisbn).identifiers
+      assert_nil(actual[:isbn])
+    end
+  end
+
+  test 'ISSNs detected in a string' do
+    actual = StandardIdentifiers.new('test 0250-6335 test').identifiers
+    assert_equal('0250-6335', actual[:issn])
+  end
+
+  test 'ISSN examples' do
+    samples = %w[0250-6335 0000-0019 1864-0761 1877-959X 0973-7758 1877-5683 1440-172X 1040-5631]
+
+    samples.each do |issn|
+      actual = StandardIdentifiers.new(issn).identifiers
+      assert_equal(issn, actual[:issn])
+    end
+  end
+
+  test 'not ISSN examples' do
+    samples = ['orange cats like popcorn', '12346798', 'another ISSN not found here', '99921-58-10-7']
+
+    samples.each do |notissn|
+      actual = StandardIdentifiers.new(notissn).identifiers
+      assert_nil(actual[:issn])
+    end
+  end
+
+  test 'ISSNs need boundaries' do
+    actual = StandardIdentifiers.new('12345-5678 1234-56789').identifiers
+    assert_nil(actual[:issn])
+  end
+
+  test 'ISSN validate rejects ISSNs with wrong check digit' do
+    samples = %w[
+      1234-5678
+      2015-2016
+      1460-2441
+      1460-2442
+      1460-2443
+      1460-2444
+      1460-2445
+      1460-2446
+      1460-2447
+      1460-2448
+      1460-2449
+      1460-2440
+      0250-6331
+      0250-6332
+      0250-6333
+      0250-6334
+      0250-6336
+      0250-6337
+      0250-6338
+      0250-6339
+      0250-6330
+      0250-633x
+      0250-633X
+    ]
+    samples.each do |notissn|
+      actual = StandardIdentifiers.new(notissn).identifiers
+      assert_nil(actual[:issn])
+    end
+  end
+
+  test 'ISSN validate method accepts ISSNs with correct check digit' do
+    samples = %w[
+      1460-244X
+      2015-223x
+      0250-6335
+      0973-7758
+    ]
+    samples.each do |issn|
+      actual = StandardIdentifiers.new(issn).identifiers
+      assert_equal(issn, actual[:issn])
+    end
+  end
+
+  test 'doi detected in string' do
+    actual = StandardIdentifiers.new('"Quantum tomography: Measured measurement", Markus Aspelmeyer, nature physics "\
+                                     "January 2009, Volume 5, No 1, pp11-12; [ doi:10.1038/nphys1170 ]').identifiers
+    assert_equal('10.1038/nphys1170', actual[:doi])
+  end
+
+  test 'doi examples' do
+    samples = %w[10.1038/nphys1170 10.1002/0470841559.ch1 10.1594/PANGAEA.726855 10.1594/GFZ.GEOFON.gfz2009kciu
+                 10.1594/PANGAEA.667386 10.3207/2959859860 10.3866/PKU.WHXB201112303 10.1430/8105 10.1392/BC1.0]
+
+    samples.each do |doi|
+      actual = StandardIdentifiers.new(doi).identifiers
+      assert_equal(doi, actual[:doi])
+    end
+  end
+
+  test 'not doi examples' do
+    samples = ['orange cats like popcorn', '10.1234 almost doi', 'another doi not found here', '99921-58-10-7']
+
+    samples.each do |notdoi|
+      actual = StandardIdentifiers.new(notdoi).identifiers
+      assert_nil(actual[:notdoi])
+    end
+  end
+
+  test 'pmid detected in string' do
+    actual = StandardIdentifiers.new('Citation and stuff PMID: 35648703 more stuff.').identifiers
+    assert_equal('PMID: 35648703', actual[:pmid])
+  end
+
+  test 'pmid examples' do
+    samples = ['PMID: 35648703', 'pmid: 1234567']
+
+    samples.each do |pmid|
+      actual = StandardIdentifiers.new(pmid).identifiers
+      assert_equal(pmid, actual[:pmid])
+    end
+  end
+
+  test 'not pmid examples' do
+    samples = ['orange cats like popcorn', 'pmid:almost', 'PMID: asdf', '99921-58-10-7']
+
+    samples.each do |notpmid|
+      actual = StandardIdentifiers.new(notpmid).identifiers
+      assert_nil(actual[:pmid])
+    end
+  end
+end