Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tco17 historical snapshots aggregations #49

Merged
merged 7 commits into from
Jul 10, 2024

Conversation

JPrevost
Copy link
Member

@JPrevost JPrevost commented Jun 26, 2024

There are multiple commits as I wanted to keep a couple of small changes isolated so it would be easy to see what exactly was changing.

I recommend reading the commit messages for each change to get the full picture of this pull request.

Overall, this provides two new Classes and one Module that they share that allow us to capture metrics on how well TACOS is doing at detecting search intent -- it's primary purpose. These initial metrics are intended to be run monthly and keep track of how frequently each StandardIdentifier algorithm detects a match for the Term associated with a SearchEvent. When we add Hints (and every other detection in the future), we'll add matches there to these metrics as well.

This implements 2 of the 3 metric types from this Architecture Decision Record.

This is the first commit in this repository after we agreed to start experimenting with better Class and Method documentation. A separate PR will be added that describes in the README how to render the documentation during development, but if you are curious to test that now the gist is:

  • gem install yard
  • create a .yardopts file in the root of the repo with the contents --no-private --protected app/**/*.rb docs/**/*.md - README.md LICENSE
  • yard server --reload
  • Note: the yard server cache is weird when deleting methods, but otherwise that's a nice "live-ish" rendering of the codebase documentation

The primary work on this is under:
https://mitlibraries.atlassian.net/browse/TCO-17

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-49 June 26, 2024 12:51 Inactive
@JPrevost JPrevost force-pushed the tco17-historical-snapshots-aggregations branch from d4972fa to 4c4f08e Compare June 26, 2024 13:06
@JPrevost JPrevost temporarily deployed to tacos-api-pipeline-pr-49 June 26, 2024 13:06 Inactive
@jazairi jazairi self-assigned this Jun 26, 2024
Copy link
Contributor

@jazairi jazairi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! I think I'm at a point where I feel comfortable approving it, but first I'd like to remove that extra period in the scope (probably harmless, but it would make me feel better).

I'd also like your thoughts on where the module should live. I think it's more idiomatic to put it in app/lib, or at least that's what we've done in ETD. I'm open to alternatives if you feel it makes more sense in app/models, though.

EDIT: I should also note I love the detailed documentation here. I haven't tried generating the YARD docs yet to confirm everything looks as expected, but will do so once #50 lands.

app/models/search_event.rb Show resolved Hide resolved
@@ -0,0 +1,27 @@
# frozen_string_literal: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this module go in app/lib rather than app/models?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, and my answer is "maybe" :)

I think there is a convention to put modules in lib, but I tend to look at lib as more generic code that could be useful in multiple applications where as app is for this application.

https://www.codewithjason.com/put-rails-modules/ notes a similar approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to look at Concerns or a modules directory in app/models to address my desire to keep the app code together while also showing how it's different than other models.

old_month_pmid:
term: pmid_38908367
source: test
created_at: <%= 1.year.ago %>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clever!

Copy link
Contributor

@jazairi jazairi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the expectation that we'll eventually need metrics for multiple features, I think the namespacing and the new directory structure make sense. This gets a :shipit: from me, and thanks for the careful consideration of that review comment. I didn't intend for it to prompt such a substantial change, but I think the codebase is better for it. 🙂

Copy link
Member

@matt-bernhardt matt-bernhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the time to look over this PR. I'm fine with this merging, and the specific question about namespaces makes sense to me. It is different from what we've done in the past, but I can see logic in handling things this way and I'm ready to give it a shot.

The comments I've made at various places are all non-blocking, just me noting a few things that stick out to me as I went through the changeset. Nothing needs to be changed, but there are some topics in the back of my mind that might be good to revisit down the road to see if adjustments might make sense.

phrase: '10.1016/j.physio.2010.12.004'

isbn_9781319145446:
phrase: 'Sadava, D. E., D. M. Hillis, et al. Life: The Science of Biology. 11th ed. W. H. Freeman, 2016. ISBN: 9781319145446'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-blocking comment, but I'm curious to know how prevalent are search terms which might trip multiple detectors. If we see enough traffic like that, I think it would be worth setting up fixtures and tests to confirm that those conditions are handled appropriately.

Until we've verified that this sort of traffic is being received, however, I don't think we need to include it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely get traffic like that and anecdotally it's fine, but I agree it might be worth creating explicit tests for this behavior to make sure we don't unintentionally break it in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# with a month supplied), ideally at the beginning of the following month to ensure as
# accurate as possible statistics. Running further from the month in question will work, but matches will use the
# current versions of all algorithms which may not match the algorithm in place during the month the SearchEvent
# occurred.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a concern (waiting too long to run a monthly recap will allow code to potentially change, and poison the totals).

Also, I kind of wonder whether it might be a decent alternative to run each summary prior to a new deploy, and key the stats based on the deployed Heroku version rather than (or in addition to) the month?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I do like the idea of storing a "code version" when running metrics. I'll open a ticket for this enhancement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doi_expected_count = rand(1...100)
doi_expected_count.times do
SearchEvent.create(term: terms(:doi), source: 'test')
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where to make this comment, but it's been gnawing at me and this was one of the places where it jumps into my brain.

I don't think we need to change anything - but I am noting that we've chosen to weight these totals by search frequency, rather than weighting every term in the collection equally. I'm not sure whether that decision is the right one - it seems fine to start with, and having reviewed the ADR for this I don't think we specifically discussed this question.

I'm commenting here mostly to get this thought out of my head, and see what either of you think about this. I'm fine with merging as is now and returning to this question down the road if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the behavior here is correct, assuming the question is "how often does tacos understand a given search event?"

If the question was instead, "what percentage of search terms does tacos understand?" then I could see flipping it. However, as TACOS is designed to work at the searchevent level I believe this is the correct question and correct metric. I'm not opposed to also adding a variant metric to track Terms instead of Events.

Please let me know if I didn't understand your question.

Why are these changes being introduced:

* The ISSN detection was returning a key set to `nil` when an ISSN-like
  pattern (probably a year span) was detected but it failed the ISSN
  validation checksum. This differed from other standard_identifiers
  and caused issues when using the standard_identifiers detection for
  our montly statistics. This just deletes the key instead of setting
  it to `nil` in conditions when the checksum fails.
* Our development logging wasn't showing a useful loglevel for
  development. I set it to debug here, but we may want to allow it to
  be controlled by ENV and also consider using lograge like we do on
  other apps
Why are these changes being introduced:

* The ability to filter to a specific month is a feature needed for the
  montly aggregation of algorithm matches

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-17

How does this address that need:

* Creates a scope on the model
Why are these changes being introduced:

* Implement data models for counting algorithm matches for the current
  month Terms

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-17

See also:

* https://github.com/MITLibraries/tacos/blob/main/docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md

How does this address that need:

* Creates a new model `MontlyMatch`
* Adds methods to run each (current) StandardIdentifier algorithm on
  each Term (via the SearchEvents)

Document any side effects to this change:

* A schedulable job to run this automatically is out of scope and will
  be added under a separate ticket
Why are these changes being introduced:

* We'll hit 50 tests in the next commit which is the Rails default
  before using parallel tests.

How does this address that need:

* This updates the coverage config to work with parallel tests.
Why are these changes being introduced:

* Implement data models for counting algorithm matches for all Terms

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-17

See also:

* https://github.com/MITLibraries/tacos/blob/main/docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md

How does this address that need:

* Creates a new model `AggregateMatch`
* Adds methods to run each (current) StandardIdentifier algorithm on
each Term (via the SearchEvents)
* Adjusts `MontlyMatch` counting algorithm to be useful for both cases
  and extracts it to a module which is imported into both Classes

Document any side effects to this change:

* A schedulable job to run this automatically is out of scope and will
  be added under a separate ticket
* The tests are identical between this and `MontlyMatch`. There may be a
  way to avoid the duplication and thus ensure both get relevant updates
  but it was not clear to me how to do that in an obvious way at the
  time of this work.
Moves both types of metrics on algorithms to use a single class and
associated database table.

Creating the namespace `Metrics` should allow us to better organize code
associated with various metrics into a common space that is both
intuitive and useful.

Relevant ticket(s):
https://mitlibraries.atlassian.net/browse/TCO-17
@JPrevost JPrevost force-pushed the tco17-historical-snapshots-aggregations branch from e8e093d to 044b1a0 Compare July 10, 2024 18:22
@JPrevost JPrevost merged commit 69fe1f1 into main Jul 10, 2024
1 check passed
@JPrevost JPrevost deleted the tco17-historical-snapshots-aggregations branch July 10, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants