Tco17 historical snapshots aggregations #49

JPrevost · 2024-06-26T12:49:44Z

There are multiple commits as I wanted to keep a couple of small changes isolated so it would be easy to see what exactly was changing.

I recommend reading the commit messages for each change to get the full picture of this pull request.

Overall, this provides two new Classes and one Module that they share that allow us to capture metrics on how well TACOS is doing at detecting search intent -- it's primary purpose. These initial metrics are intended to be run monthly and keep track of how frequently each StandardIdentifier algorithm detects a match for the Term associated with a SearchEvent. When we add Hints (and every other detection in the future), we'll add matches there to these metrics as well.

This implements 2 of the 3 metric types from this Architecture Decision Record.

This is the first commit in this repository after we agreed to start experimenting with better Class and Method documentation. A separate PR will be added that describes in the README how to render the documentation during development, but if you are curious to test that now the gist is:

gem install yard
create a .yardopts file in the root of the repo with the contents --no-private --protected app/**/*.rb docs/**/*.md - README.md LICENSE
yard server --reload
Note: the yard server cache is weird when deleting methods, but otherwise that's a nice "live-ish" rendering of the codebase documentation

The primary work on this is under:
https://mitlibraries.atlassian.net/browse/TCO-17

jazairi

Great work on this! I think I'm at a point where I feel comfortable approving it, but first I'd like to remove that extra period in the scope (probably harmless, but it would make me feel better).

I'd also like your thoughts on where the module should live. I think it's more idiomatic to put it in app/lib, or at least that's what we've done in ETD. I'm open to alternatives if you feel it makes more sense in app/models, though.

EDIT: I should also note I love the detailed documentation here. I haven't tried generating the YARD docs yet to confirm everything looks as expected, but will do so once #50 lands.

app/models/search_event.rb

jazairi · 2024-06-27T14:22:52Z

app/models/match_counter.rb

@@ -0,0 +1,27 @@
+# frozen_string_literal: true


Should this module go in app/lib rather than app/models?

Great question, and my answer is "maybe" :)

I think there is a convention to put modules in lib, but I tend to look at lib as more generic code that could be useful in multiple applications where as app is for this application.

https://www.codewithjason.com/put-rails-modules/ notes a similar approach.

I'm going to look at Concerns or a modules directory in app/models to address my desire to keep the app code together while also showing how it's different than other models.

jazairi · 2024-06-27T15:02:51Z

test/fixtures/search_events.yml

+old_month_pmid:
+  term: pmid_38908367
+  source: test
+  created_at: <%= 1.year.ago %>


This is clever!

jazairi

Given the expectation that we'll eventually need metrics for multiple features, I think the namespacing and the new directory structure make sense. This gets a from me, and thanks for the careful consideration of that review comment. I didn't intend for it to prompt such a substantial change, but I think the codebase is better for it. 🙂

matt-bernhardt

Thanks for the time to look over this PR. I'm fine with this merging, and the specific question about namespaces makes sense to me. It is different from what we've done in the past, but I can see logic in handling things this way and I'm ready to give it a shot.

The comments I've made at various places are all non-blocking, just me noting a few things that stick out to me as I went through the changeset. Nothing needs to be changed, but there are some topics in the back of my mind that might be good to revisit down the road to see if adjustments might make sense.

matt-bernhardt · 2024-07-10T16:15:36Z

test/fixtures/terms.yml

+  phrase: '10.1016/j.physio.2010.12.004'
+
+isbn_9781319145446:
+  phrase: 'Sadava, D. E., D. M. Hillis, et al. Life: The Science of Biology. 11th ed. W. H. Freeman, 2016. ISBN: 9781319145446'


This is a non-blocking comment, but I'm curious to know how prevalent are search terms which might trip multiple detectors. If we see enough traffic like that, I think it would be worth setting up fixtures and tests to confirm that those conditions are handled appropriately.

Until we've verified that this sort of traffic is being received, however, I don't think we need to include it.

We definitely get traffic like that and anecdotally it's fine, but I agree it might be worth creating explicit tests for this behavior to make sure we don't unintentionally break it in the future.

https://mitlibraries.atlassian.net/browse/TCO-43

matt-bernhardt · 2024-07-10T17:53:12Z

app/models/metrics/algorithms.rb

+    #   with a month supplied), ideally at the beginning of the following month to ensure as
+    #   accurate as possible statistics. Running further from the month in question will work, but matches will use the
+    #   current versions of all algorithms which may not match the algorithm in place during the month the SearchEvent
+    #   occurred.


I agree that this is a concern (waiting too long to run a monthly recap will allow code to potentially change, and poison the totals).

Also, I kind of wonder whether it might be a decent alternative to run each summary prior to a new deploy, and key the stats based on the deployed Heroku version rather than (or in addition to) the month?

Interesting. I do like the idea of storing a "code version" when running metrics. I'll open a ticket for this enhancement.

https://mitlibraries.atlassian.net/browse/TCO-44

matt-bernhardt · 2024-07-10T18:06:35Z

test/models/metrics/algorithms_test.rb

+    doi_expected_count = rand(1...100)
+    doi_expected_count.times do
+      SearchEvent.create(term: terms(:doi), source: 'test')
+    end


I'm not sure where to make this comment, but it's been gnawing at me and this was one of the places where it jumps into my brain.

I don't think we need to change anything - but I am noting that we've chosen to weight these totals by search frequency, rather than weighting every term in the collection equally. I'm not sure whether that decision is the right one - it seems fine to start with, and having reviewed the ADR for this I don't think we specifically discussed this question.

I'm commenting here mostly to get this thought out of my head, and see what either of you think about this. I'm fine with merging as is now and returning to this question down the road if needed.

I think the behavior here is correct, assuming the question is "how often does tacos understand a given search event?"

If the question was instead, "what percentage of search terms does tacos understand?" then I could see flipping it. However, as TACOS is designed to work at the searchevent level I believe this is the correct question and correct metric. I'm not opposed to also adding a variant metric to track Terms instead of Events.

Please let me know if I didn't understand your question.

Why are these changes being introduced: * The ISSN detection was returning a key set to `nil` when an ISSN-like pattern (probably a year span) was detected but it failed the ISSN validation checksum. This differed from other standard_identifiers and caused issues when using the standard_identifiers detection for our montly statistics. This just deletes the key instead of setting it to `nil` in conditions when the checksum fails. * Our development logging wasn't showing a useful loglevel for development. I set it to debug here, but we may want to allow it to be controlled by ENV and also consider using lograge like we do on other apps

Why are these changes being introduced: * The ability to filter to a specific month is a feature needed for the montly aggregation of algorithm matches Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-17 How does this address that need: * Creates a scope on the model

Why are these changes being introduced: * Implement data models for counting algorithm matches for the current month Terms Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-17 See also: * https://github.com/MITLibraries/tacos/blob/main/docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md How does this address that need: * Creates a new model `MontlyMatch` * Adds methods to run each (current) StandardIdentifier algorithm on each Term (via the SearchEvents) Document any side effects to this change: * A schedulable job to run this automatically is out of scope and will be added under a separate ticket

Why are these changes being introduced: * We'll hit 50 tests in the next commit which is the Rails default before using parallel tests. How does this address that need: * This updates the coverage config to work with parallel tests.

Why are these changes being introduced: * Implement data models for counting algorithm matches for all Terms Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TCO-17 See also: * https://github.com/MITLibraries/tacos/blob/main/docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md How does this address that need: * Creates a new model `AggregateMatch` * Adds methods to run each (current) StandardIdentifier algorithm on each Term (via the SearchEvents) * Adjusts `MontlyMatch` counting algorithm to be useful for both cases and extracts it to a module which is imported into both Classes Document any side effects to this change: * A schedulable job to run this automatically is out of scope and will be added under a separate ticket * The tests are identical between this and `MontlyMatch`. There may be a way to avoid the duplication and thus ensure both get relevant updates but it was not clear to me how to do that in an obvious way at the time of this work.

Moves both types of metrics on algorithms to use a single class and associated database table. Creating the namespace `Metrics` should allow us to better organize code associated with various metrics into a common space that is both intuitive and useful. Relevant ticket(s): https://mitlibraries.atlassian.net/browse/TCO-17

mitlib temporarily deployed to tacos-api-pipeline-pr-49 June 26, 2024 12:51 Inactive

JPrevost force-pushed the tco17-historical-snapshots-aggregations branch from d4972fa to 4c4f08e Compare June 26, 2024 13:06

JPrevost temporarily deployed to tacos-api-pipeline-pr-49 June 26, 2024 13:06 Inactive

JPrevost requested review from matt-bernhardt and jazairi June 26, 2024 13:18

jazairi self-assigned this Jun 26, 2024

jazairi requested changes Jun 27, 2024

View reviewed changes

jazairi mentioned this pull request Jun 27, 2024

Adds yard and makefile with development commands #50

Merged

JPrevost force-pushed the tco17-historical-snapshots-aggregations branch from 4c4f08e to 26cafd0 Compare June 27, 2024 21:19

JPrevost requested a review from jazairi July 1, 2024 16:36

jazairi approved these changes Jul 1, 2024

View reviewed changes

matt-bernhardt approved these changes Jul 10, 2024

View reviewed changes

JPrevost added 7 commits July 10, 2024 14:22

Add parallel test support to coverage

abd0a5b

Why are these changes being introduced: * We'll hit 50 tests in the next commit which is the Rails default before using parallel tests. How does this address that need: * This updates the coverage config to work with parallel tests.

Refactor to concern

1f33926

JPrevost force-pushed the tco17-historical-snapshots-aggregations branch from e8e093d to 044b1a0 Compare July 10, 2024 18:22

JPrevost merged commit 69fe1f1 into main Jul 10, 2024
1 check passed

JPrevost deleted the tco17-historical-snapshots-aggregations branch July 10, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tco17 historical snapshots aggregations #49

Tco17 historical snapshots aggregations #49

JPrevost commented Jun 26, 2024 •

edited

Loading

jazairi left a comment •

edited

Loading

jazairi Jun 27, 2024

JPrevost Jun 27, 2024

JPrevost Jun 28, 2024

jazairi Jun 27, 2024

jazairi left a comment

matt-bernhardt left a comment

matt-bernhardt Jul 10, 2024

JPrevost Jul 10, 2024

JPrevost Jul 10, 2024

matt-bernhardt Jul 10, 2024

JPrevost Jul 10, 2024

JPrevost Jul 10, 2024

matt-bernhardt Jul 10, 2024

JPrevost Jul 10, 2024

Tco17 historical snapshots aggregations #49

Tco17 historical snapshots aggregations #49

Conversation

JPrevost commented Jun 26, 2024 • edited Loading

jazairi left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jazairi left a comment

Choose a reason for hiding this comment

matt-bernhardt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JPrevost commented Jun 26, 2024 •

edited

Loading

jazairi left a comment •

edited

Loading