Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engx242 identifierpatterns #13

Merged
merged 3 commits into from
Nov 17, 2023
Merged

Engx242 identifierpatterns #13

merged 3 commits into from
Nov 17, 2023

Conversation

JPrevost
Copy link
Member

@JPrevost JPrevost commented Nov 8, 2023

Why these changes are being introduced

Adds detection and return of detected ISSN, ISBN, DOI, PMID if requested via GraphQL

If the field is not requested via GraphQL, we do not do the detection.

Not included at this time is external lookups to return additional information, but we may want to discuss how that would work as part of this to ensure that need fits in cleanly with this GraphQL design

Side effects of this change

  • Adds support for lookupTerm to be able to query without logging an event
  • Adds phrase to SearchEventType even though it is technically part of the TermType for convenience

Relevant ticket(s)

@JPrevost JPrevost marked this pull request as draft November 8, 2023 19:32
@JPrevost JPrevost marked this pull request as ready for review November 15, 2023 20:55
@jazairi jazairi self-assigned this Nov 15, 2023
Copy link
Contributor

@jazairi jazairi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has my tentative approval. Before merging, I think we should make a naming decision vis-à-vis StandardIdentifiers and PatternDetector (see my comment in the reference doc).

Also, I'm not sure how much rebasing you plan to do, but I agree with your Slack comment about adding Matt as coauthor, and possibly adding a note that some of these algorithms came from our TIMDEX-UI experiment. That seems like it may be useful to our future selves.

end

def validate_issn(candidate)
# This model is only called when the regex for an ISSN has indicated an ISSN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, I think: "model" should be "method" here.

@@ -0,0 +1,58 @@
## Pattern detection and metadata enhancement

The PatternDetector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID. Other techniques than regular expressions may also occur here, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Pattern Detector is a more descriptive name for what the model and GraphQL type are trying to do, but less descriptive for the GraphQL field than StandardIdentifiers. Would it be confusing for the field to have a different name than the model and type? Is that even possible?

This documentation also mentions an Enhancer, which is not yet implemented in this PR. (I assume it will be soon, though, based on the experiment you shared last week.) If that model/type is not named Enhancer, then an alternative approach could be to clarify which models/types are an example of Pattern Detector, and which are an example of Enhancer. In other words, Pattern Detector and Enhancer are how we're describing the features at a high level, whereas StandardIdentifiers and RecordsFound (or whatever) are the actual implementations of those features. In that case, we should make that clarification in this reference doc.

I do think we should make a decision on this before merging, because I suspect it will be confusing if the naming conventions in the docs don't match what's in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super helpful feedback and I agree. I'll give it some thought, likely ask for a synchronous chat, then make appropriate changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards having PatternDetectors and Enhancers be the high level concept for this type of work, and StandardIdentifiers being our first implementation of a PatternDetector and coming soon, LookupDoi (etc) being implementations of Enhancers.

I'll be adding a commit that reflects this and then reaching out for further discussion.

app/models/standard_identifiers.rb Show resolved Hide resolved
Copy link
Contributor

@jazairi jazairi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation reads much clearer to me, and I think this approach gives us the best of both naming conventions. Thanks for incorporating that feedback. 👍

@@ -1,10 +1,10 @@
## Pattern detection and metadata enhancement

The PatternDetector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID. Other techniques than regular expressions may also occur here, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses.
A PatternDetector is responsible for identifying specific patterns within the input, such as using regular expressions to detect ISSN, ISBN, DOI, and PMID (implemented in our StandardIdentifiers Class). Other techniques than regular expressions may also occur as PatternDetectors, such as doing phrase matching to identify known scientific journals, or fingerprint matching to identify librarian curated responses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is non-blocking: it may be useful to write this as Pattern Detector rather than PatternDetector, so it's clear that we're talking about a concept and not a model/module.

Why are these changes being introduced:

Detection of patterns will eventually help us with categorizing the
incoming search terms, which is the ultimate goal of this application.

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/ENGX-242
* https://mitlibraries.atlassian.net/browse/ENGX-241

The StandardIdentifier patterns and tests were originally developed as
part of our TIMDEX UI application and have been extracted and modified
for use in this application.

https://github.com/MITLibraries/timdex-ui

co-authored-by: Matt Bernhardt <mjbernha@mit.edu>
@JPrevost JPrevost merged commit 0503811 into main Nov 17, 2023
2 checks passed
@JPrevost JPrevost deleted the engx242-identifierpatterns branch November 17, 2023 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants