Skip to content

Commit

Permalink
Adds standard identifiers detection
Browse files Browse the repository at this point in the history
Why are these changes being introduced:

Detection of patterns will eventually help us with categorizing the
incoming search terms, which is the ultimate goal of this application.

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/ENGX-242
* https://mitlibraries.atlassian.net/browse/ENGX-241

The StandardIdentifier patterns and tests were originally developed as
part of our TIMDEX UI application and have been extracted and modified
for use in this application.

https://github.com/MITLibraries/timdex-ui

co-authored-by: Matt Bernhardt <mjbernha@mit.edu>
  • Loading branch information
JPrevost and matt-bernhardt committed Nov 17, 2023
1 parent f301976 commit f14464e
Show file tree
Hide file tree
Showing 7 changed files with 353 additions and 0 deletions.
10 changes: 10 additions & 0 deletions app/graphql/types/search_event_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,18 @@ class SearchEventType < Types::BaseObject
field :created_at, GraphQL::Types::ISO8601DateTime, null: false
field :updated_at, GraphQL::Types::ISO8601DateTime, null: false
field :phrase, String
field :standard_identifiers, [StandardIdentifiersType]

def phrase
@object.term.phrase
end

def standard_identifiers
ids = []
StandardIdentifiers.new(@object.term.phrase).identifiers.each do |identifier|
ids << { kind: identifier.first, value: identifier.last }
end
ids
end
end
end
8 changes: 8 additions & 0 deletions app/graphql/types/standard_identifiers_type.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# frozen_string_literal: true

module Types
class StandardIdentifiersType < Types::BaseObject
field :kind, String, null: false
field :value, String, null: false
end
end
9 changes: 9 additions & 0 deletions app/graphql/types/term_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,18 @@ class TermType < Types::BaseObject
field :phrase, String, null: false
field :occurence_count, Integer
field :search_events, [SearchEventType], null: false
field :standard_identifiers, [StandardIdentifiersType]

def occurence_count
@object.search_events.count
end

def standard_identifiers
ids = []
StandardIdentifiers.new(@object.phrase).identifiers.each do |identifier|
ids << { kind: identifier.first, value: identifier.last }
end
ids
end
end
end
73 changes: 73 additions & 0 deletions app/models/standard_identifiers.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# frozen_string_literal: true

# StandardIdentifiers is a PatternDectector implementation that detects the identifiers DOI, ISBN, ISSN, PMID.
# See /docs/reference/pattern_detection_and_enhancement.md for details.
class StandardIdentifiers
attr_reader :identifiers

def initialize(term)
@identifiers = {}
term_pattern_checker(term)
strip_invalid_issns
end

private

def term_pattern_checker(term)
term_patterns.each_pair do |type, pattern|
@identifiers[type.to_sym] = match(pattern, term) if match(pattern, term).present?
end
end

# Note on the limitations of this implementation
# We only detect the first match of each pattern, so a search of "1234-5678 5678-1234" will not return two ISSNs as
# might be expected, but just "1234-5678". Using ruby's string.scan(pattern) may be worthwhile if we want to detect
# all possible matches instead of just the first. That may require a larger refactor though as initial tests of doing
# that change did result in unintended results so it was backed out for now.
def match(pattern, term)
pattern.match(term).to_s.strip
end

# term_patterns are regex patterns to be applied to the basic search box input
def term_patterns
{
isbn: /\b(ISBN-*(1[03])* *(: ){0,1})*(([0-9Xx][- ]*){13}|([0-9Xx][- ]*){10})\b/,
issn: /\b[0-9]{4}-[0-9]{3}[0-9xX]\b/,
pmid: /\b((pmid|PMID): (\d{7,8}))\b/,
doi: %r{\b10\.(\d+\.*)+/(([^\s.])+\.*)+\b}
}
end

def strip_invalid_issns
return unless @identifiers[:issn]

@identifiers[:issn] = nil unless validate_issn(@identifiers[:issn])
end

# validate_issn is only called when the regex for an ISSN has indicated an ISSN
# of sufficient format is present - but the regex does not attempt to
# validate that the check digit in the ISSN spec is correct. This method
# does that calculation, so we do not returned falsely detected ISSNs,
# like "2015-2019".
#
# The algorithm is defined at
# https://datatracker.ietf.org/doc/html/rfc3044#section-2.2
# An example calculation is shared at
# https://en.wikipedia.org/wiki/International_Standard_Serial_Number#Code_format
def validate_issn(candidate)
digits = candidate.gsub('-', '').chars[..6]
check_digit = candidate.last.downcase
sum = 0

digits.each_with_index do |digit, idx|
sum += digit.to_i * (8 - idx.to_i)
end

actual_digit = 11 - sum.modulo(11)
actual_digit = 'x' if actual_digit == 10

return true if actual_digit.to_s == check_digit.to_s

false
end
end
60 changes: 60 additions & 0 deletions docs/reference/pattern_detection_and_enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
## Pattern detection and metadata enhancement

A Pattern Detector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID (implemented in our StandardIdentifiers Class). Other techniques than regular expressions may also occur as Pattern Detectors, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses.

A Pattern Detector is only run when the incoming data has requested this type of information to be returned. This will take the form of requesting specific fields to be returned via GraphQL that require using Pattern Detector to populate.

An appropriate Enhancer for the specific Pattern will add more detailed metadata if requested via GraphQL. This will allow the slowest portion of this data flow -- the external data lookups (Enhancers) -- to only be run if the caller has specifically asked for that data. Some users may only be interested in knowing that patterns were found and what they were, whereas others are willing to wait longer for more detailed information. And others still won't be interested in either. **The incoming GraphQL will be the driver of which algorithms we run, and which external data we request.**

```mermaid
---
title: "Pattern Detector: detecting known patterns and selectively enhancing the output"
---
flowchart LR
accTitle: "Pattern Detector: detecting known patterns and selectively enhancing the output"
accDescr: A flow chart showing how input is analyzed for patterns and decisions are made based on what was found. The workflow is described fully in the paragraphs of text following this diagram.
input(input)
detect[PatternDetector]
lookup[(DataLookup)]
enhance(enhance)
found{found?}
details{details requested?}
metadata{metadata found?}
annotate[[annotate]]
output
enhance --> output
subgraph PatternDetector
direction TB
detect --doi--> found
detect --issn--> found
detect --isbn----> found
detect --journal title--> found
detect --pmid--> found
annotate
end
subgraph Enhancer
lookup --> metadata
metadata -- yes --> enhance[[enhance]]
enhance
end
input --> PatternDetector
metadata -- no --> output
found -- no --> output
found -- yes --> annotate
annotate --> details
details -- no --> output
details -- yes --> lookup
output
```

When receiving an input, first we detect known patterns such as DOI, ISSN, ISBN, PMID, or Journal Titles.

If we do not find any, we exit the flow with an empty output.

If we find one more more patterns, we annotate the eventual response with what we found. If the original input did not request details for found patterns, we return the annotated response with what we found.

If the original input did request details for found patterns, we lookup information. If we do not find additional information, we return the annotated output. If we do find additional information, we enhance the annotation with the metadata we have found and return that in the output.
17 changes: 17 additions & 0 deletions test/controllers/graphql_controller_test.rb
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# frozen_string_literal: true

require 'test_helper'

class GraphqlControllerTest < ActionDispatch::IntegrationTest
Expand Down Expand Up @@ -48,4 +50,19 @@ class GraphqlControllerTest < ActionDispatch::IntegrationTest
assert_equal(200, response.status)
assert_equal Term.count, initial_term_count
end

test 'search event query can return detected standard identifiers' do
post '/graphql', params: { query: '{
logSearchEvent(sourceSystem: "timdex", searchTerm: "10.1038/nphys1170") {
standardIdentifiers {
kind
value
}
}
}' }

json = JSON.parse(response.body)
assert_equal('doi', json['data']['logSearchEvent']['standardIdentifiers'].first['kind'])
assert_equal('10.1038/nphys1170', json['data']['logSearchEvent']['standardIdentifiers'].first['value'])
end
end
176 changes: 176 additions & 0 deletions test/models/standard_identifiers_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# frozen_string_literal: true

require 'test_helper'

class StandardIdentifiersTest < ActiveSupport::TestCase
test 'ISBN detected in a string' do
actual = StandardIdentifiers.new('test 978-3-16-148410-0 test').identifiers

assert_equal('978-3-16-148410-0', actual[:isbn])
end

test 'ISBN-10 examples' do
# from wikipedia
samples = ['99921-58-10-7', '9971-5-0210-0', '960-425-059-0', '80-902734-1-6', '85-359-0277-5',
'1-84356-028-3', '0-684-84328-5', '0-8044-2957-X', '0-85131-041-9', '93-86954-21-4', '0-943396-04-2',
'0-9752298-0-X']

samples.each do |isbn|
actual = StandardIdentifiers.new(isbn).identifiers
assert_equal(isbn, actual[:isbn])
end
end

test 'ISBN-13 examples' do
samples = ['978-99921-58-10-7', '979-9971-5-0210-0', '978-960-425-059-0', '979-80-902734-1-6', '978-85-359-0277-5',
'979-1-84356-028-3', '978-0-684-84328-5', '979-0-8044-2957-X', '978-0-85131-041-9', '979-93-86954-21-4',
'978-0-943396-04-2', '979-0-9752298-0-X']

samples.each do |isbn|
actual = StandardIdentifiers.new(isbn).identifiers
assert_equal(isbn, actual[:isbn])
end
end

test 'not ISBNs' do
samples = ['orange cats like popcorn', '1234-6798', 'another ISBN not found here']

samples.each do |notisbn|
actual = StandardIdentifiers.new(notisbn).identifiers
assert_nil(actual[:isbn])
end
end

test 'ISBNs need boundaries' do
samples = ['990026671500206761', '979-0-9752298-0-XYZ']
# note, there is a theoretical case of `asdf979-0-9752298-0-X` returning as an ISBN 10 even though it doesn't
# have a word boundary because the `-` is treated as a boundary so `0-9752298-0-X` would be an ISBN10. We can
# consider whether we care in the future as we look for incorrect real-world matches.

samples.each do |notisbn|
actual = StandardIdentifiers.new(notisbn).identifiers
assert_nil(actual[:isbn])
end
end

test 'ISSNs detected in a string' do
actual = StandardIdentifiers.new('test 0250-6335 test').identifiers
assert_equal('0250-6335', actual[:issn])
end

test 'ISSN examples' do
samples = %w[0250-6335 0000-0019 1864-0761 1877-959X 0973-7758 1877-5683 1440-172X 1040-5631]

samples.each do |issn|
actual = StandardIdentifiers.new(issn).identifiers
assert_equal(issn, actual[:issn])
end
end

test 'not ISSN examples' do
samples = ['orange cats like popcorn', '12346798', 'another ISSN not found here', '99921-58-10-7']

samples.each do |notissn|
actual = StandardIdentifiers.new(notissn).identifiers
assert_nil(actual[:issn])
end
end

test 'ISSNs need boundaries' do
actual = StandardIdentifiers.new('12345-5678 1234-56789').identifiers
assert_nil(actual[:issn])
end

test 'ISSN validate rejects ISSNs with wrong check digit' do
samples = %w[
1234-5678
2015-2016
1460-2441
1460-2442
1460-2443
1460-2444
1460-2445
1460-2446
1460-2447
1460-2448
1460-2449
1460-2440
0250-6331
0250-6332
0250-6333
0250-6334
0250-6336
0250-6337
0250-6338
0250-6339
0250-6330
0250-633x
0250-633X
]
samples.each do |notissn|
actual = StandardIdentifiers.new(notissn).identifiers
assert_nil(actual[:issn])
end
end

test 'ISSN validate method accepts ISSNs with correct check digit' do
samples = %w[
1460-244X
2015-223x
0250-6335
0973-7758
]
samples.each do |issn|
actual = StandardIdentifiers.new(issn).identifiers
assert_equal(issn, actual[:issn])
end
end

test 'doi detected in string' do
actual = StandardIdentifiers.new('"Quantum tomography: Measured measurement", Markus Aspelmeyer, nature physics "\
"January 2009, Volume 5, No 1, pp11-12; [ doi:10.1038/nphys1170 ]').identifiers
assert_equal('10.1038/nphys1170', actual[:doi])
end

test 'doi examples' do
samples = %w[10.1038/nphys1170 10.1002/0470841559.ch1 10.1594/PANGAEA.726855 10.1594/GFZ.GEOFON.gfz2009kciu
10.1594/PANGAEA.667386 10.3207/2959859860 10.3866/PKU.WHXB201112303 10.1430/8105 10.1392/BC1.0]

samples.each do |doi|
actual = StandardIdentifiers.new(doi).identifiers
assert_equal(doi, actual[:doi])
end
end

test 'not doi examples' do
samples = ['orange cats like popcorn', '10.1234 almost doi', 'another doi not found here', '99921-58-10-7']

samples.each do |notdoi|
actual = StandardIdentifiers.new(notdoi).identifiers
assert_nil(actual[:notdoi])
end
end

test 'pmid detected in string' do
actual = StandardIdentifiers.new('Citation and stuff PMID: 35648703 more stuff.').identifiers
assert_equal('PMID: 35648703', actual[:pmid])
end

test 'pmid examples' do
samples = ['PMID: 35648703', 'pmid: 1234567']

samples.each do |pmid|
actual = StandardIdentifiers.new(pmid).identifiers
assert_equal(pmid, actual[:pmid])
end
end

test 'not pmid examples' do
samples = ['orange cats like popcorn', 'pmid:almost', 'PMID: asdf', '99921-58-10-7']

samples.each do |notpmid|
actual = StandardIdentifiers.new(notpmid).identifiers
assert_nil(actual[:pmid])
end
end
end

0 comments on commit f14464e

Please sign in to comment.