-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tco17 historical snapshots aggregations #49
Conversation
d4972fa
to
4c4f08e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this! I think I'm at a point where I feel comfortable approving it, but first I'd like to remove that extra period in the scope (probably harmless, but it would make me feel better).
I'd also like your thoughts on where the module should live. I think it's more idiomatic to put it in app/lib
, or at least that's what we've done in ETD. I'm open to alternatives if you feel it makes more sense in app/models
, though.
EDIT: I should also note I love the detailed documentation here. I haven't tried generating the YARD docs yet to confirm everything looks as expected, but will do so once #50 lands.
app/models/match_counter.rb
Outdated
@@ -0,0 +1,27 @@ | |||
# frozen_string_literal: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this module go in app/lib
rather than app/models
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question, and my answer is "maybe" :)
I think there is a convention to put modules in lib
, but I tend to look at lib
as more generic code that could be useful in multiple applications where as app
is for this application.
https://www.codewithjason.com/put-rails-modules/ notes a similar approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to look at Concerns or a modules
directory in app/models
to address my desire to keep the app code together while also showing how it's different than other models.
old_month_pmid: | ||
term: pmid_38908367 | ||
source: test | ||
created_at: <%= 1.year.ago %> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clever!
4c4f08e
to
26cafd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the expectation that we'll eventually need metrics for multiple features, I think the namespacing and the new directory structure make sense. This gets a from me, and thanks for the careful consideration of that review comment. I didn't intend for it to prompt such a substantial change, but I think the codebase is better for it. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the time to look over this PR. I'm fine with this merging, and the specific question about namespaces makes sense to me. It is different from what we've done in the past, but I can see logic in handling things this way and I'm ready to give it a shot.
The comments I've made at various places are all non-blocking, just me noting a few things that stick out to me as I went through the changeset. Nothing needs to be changed, but there are some topics in the back of my mind that might be good to revisit down the road to see if adjustments might make sense.
phrase: '10.1016/j.physio.2010.12.004' | ||
|
||
isbn_9781319145446: | ||
phrase: 'Sadava, D. E., D. M. Hillis, et al. Life: The Science of Biology. 11th ed. W. H. Freeman, 2016. ISBN: 9781319145446' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a non-blocking comment, but I'm curious to know how prevalent are search terms which might trip multiple detectors. If we see enough traffic like that, I think it would be worth setting up fixtures and tests to confirm that those conditions are handled appropriately.
Until we've verified that this sort of traffic is being received, however, I don't think we need to include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely get traffic like that and anecdotally it's fine, but I agree it might be worth creating explicit tests for this behavior to make sure we don't unintentionally break it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# with a month supplied), ideally at the beginning of the following month to ensure as | ||
# accurate as possible statistics. Running further from the month in question will work, but matches will use the | ||
# current versions of all algorithms which may not match the algorithm in place during the month the SearchEvent | ||
# occurred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a concern (waiting too long to run a monthly recap will allow code to potentially change, and poison the totals).
Also, I kind of wonder whether it might be a decent alternative to run each summary prior to a new deploy, and key the stats based on the deployed Heroku version rather than (or in addition to) the month?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I do like the idea of storing a "code version" when running metrics. I'll open a ticket for this enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doi_expected_count = rand(1...100) | ||
doi_expected_count.times do | ||
SearchEvent.create(term: terms(:doi), source: 'test') | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where to make this comment, but it's been gnawing at me and this was one of the places where it jumps into my brain.
I don't think we need to change anything - but I am noting that we've chosen to weight these totals by search frequency, rather than weighting every term in the collection equally. I'm not sure whether that decision is the right one - it seems fine to start with, and having reviewed the ADR for this I don't think we specifically discussed this question.
I'm commenting here mostly to get this thought out of my head, and see what either of you think about this. I'm fine with merging as is now and returning to this question down the road if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the behavior here is correct, assuming the question is "how often does tacos understand a given search event?"
If the question was instead, "what percentage of search terms does tacos understand?" then I could see flipping it. However, as TACOS is designed to work at the searchevent level I believe this is the correct question and correct metric. I'm not opposed to also adding a variant metric to track Terms instead of Events.
Please let me know if I didn't understand your question.
Why are these changes being introduced: * The ISSN detection was returning a key set to `nil` when an ISSN-like pattern (probably a year span) was detected but it failed the ISSN validation checksum. This differed from other standard_identifiers and caused issues when using the standard_identifiers detection for our montly statistics. This just deletes the key instead of setting it to `nil` in conditions when the checksum fails. * Our development logging wasn't showing a useful loglevel for development. I set it to debug here, but we may want to allow it to be controlled by ENV and also consider using lograge like we do on other apps
Why are these changes being introduced: * The ability to filter to a specific month is a feature needed for the montly aggregation of algorithm matches Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-17 How does this address that need: * Creates a scope on the model
Why are these changes being introduced: * Implement data models for counting algorithm matches for the current month Terms Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-17 See also: * https://github.com/MITLibraries/tacos/blob/main/docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md How does this address that need: * Creates a new model `MontlyMatch` * Adds methods to run each (current) StandardIdentifier algorithm on each Term (via the SearchEvents) Document any side effects to this change: * A schedulable job to run this automatically is out of scope and will be added under a separate ticket
Why are these changes being introduced: * We'll hit 50 tests in the next commit which is the Rails default before using parallel tests. How does this address that need: * This updates the coverage config to work with parallel tests.
Why are these changes being introduced: * Implement data models for counting algorithm matches for all Terms Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-17 See also: * https://github.com/MITLibraries/tacos/blob/main/docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md How does this address that need: * Creates a new model `AggregateMatch` * Adds methods to run each (current) StandardIdentifier algorithm on each Term (via the SearchEvents) * Adjusts `MontlyMatch` counting algorithm to be useful for both cases and extracts it to a module which is imported into both Classes Document any side effects to this change: * A schedulable job to run this automatically is out of scope and will be added under a separate ticket * The tests are identical between this and `MontlyMatch`. There may be a way to avoid the duplication and thus ensure both get relevant updates but it was not clear to me how to do that in an obvious way at the time of this work.
Moves both types of metrics on algorithms to use a single class and associated database table. Creating the namespace `Metrics` should allow us to better organize code associated with various metrics into a common space that is both intuitive and useful. Relevant ticket(s): https://mitlibraries.atlassian.net/browse/TCO-17
e8e093d
to
044b1a0
Compare
There are multiple commits as I wanted to keep a couple of small changes isolated so it would be easy to see what exactly was changing.
I recommend reading the commit messages for each change to get the full picture of this pull request.
Overall, this provides two new Classes and one Module that they share that allow us to capture metrics on how well TACOS is doing at detecting search intent -- it's primary purpose. These initial metrics are intended to be run monthly and keep track of how frequently each StandardIdentifier algorithm detects a match for the Term associated with a SearchEvent. When we add Hints (and every other detection in the future), we'll add matches there to these metrics as well.
This implements 2 of the 3 metric types from this Architecture Decision Record.
This is the first commit in this repository after we agreed to start experimenting with better Class and Method documentation. A separate PR will be added that describes in the README how to render the documentation during development, but if you are curious to test that now the gist is:
gem install yard
.yardopts
file in the root of the repo with the contents--no-private --protected app/**/*.rb docs/**/*.md - README.md LICENSE
yard server --reload
The primary work on this is under:
https://mitlibraries.atlassian.net/browse/TCO-17