Adds ADR for historical analytics approach #47

JPrevost · 2024-06-20T16:08:41Z

Why are these changes being introduced:

There are multiple ways we can address this problem
Using an ADR to provide context and options, gathering feedback on the proposed solution, and then merging the decision allows us to see what else we considered in addition to what we eventually implement.

Relevant ticket(s):

https://mitlibraries.atlassian.net/browse/TCO-16

matt-bernhardt

I'm going to tentatively approve this ADR, subject to some of the questions I raise above. My strongest question is whether you're proposing these exact columns (including journal_title and staff_categorized), or if those are only illustrative.

My other comment is me raising some alternative structures, and working through why I think they don't actually work. I'm not sure whether this means that we should list them in the ADR text, but - unless you see something in here that I don't - I don't think they change the decision you're proposing.

matt-bernhardt · 2024-06-21T15:20:35Z

docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md

+```mermaid
+classDiagram
+ class TotalMatchesByAlgorithm
+    TotalMatchesByAlgorithm: +Integer id
+    TotalMatchesByAlgorithm: +Date month 
+    TotalMatchesByAlgorithm: +Integer doi
+    TotalMatchesByAlgorithm: +Integer issn
+    TotalMatchesByAlgorithm: +Integer isbn
+    TotalMatchesByAlgorithm: +Integer pmid
+    TotalMatchesByAlgorithm: +Integer journal_title
+    TotalMatchesByAlgorithm: +Integer staff_categorized
+    TotalMatchesByAlgorithm: +Integer no_matches
+```


A few questions about this...

I'm correct that both the TotalMatchesByAlgorithm and MonthlyMatchesByAlgorithm have identical structures, so the difference is entirely in the code that populates them?

At the moment, we have application code that would populate the doi, issn, ibn, and pmid columns - but the journal_title and staff_categorized columns feel unneeded at this moment. Would we create them now because we've mostly decided that we want to have them? Or would we wait until we actually can put something in them before creating the column?

More philosophically, the care and feeding of this application would require us to keep this data model in sync with the detectors that we've implemented - so it feels like this would be a good candidate for inclusion in the PR template for this app?

Regarding question 2, my preference would be to add those columns when we're ready to use them. However, I like including them in the class diagram as an indication of future state.

yes

yes, we'd only add fields as we need them not predict what we might need later

good point. Adding a "if this PR adds a new algorithm please make sure analytics account for it" kinda thing makes sense.

matt-bernhardt · 2024-06-21T15:36:46Z

docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md

+  class SearchDetections
+    SearchDetections: +Integer id
+    SearchDetections: +Integer searchevent_id
+    SearchDetections: +Integer DOI NULL
+    SearchDetections: +Integer ISBN NULL
+    SearchDetections: +Integer ISSN NULL
+    SearchDetections: +Integer PMID NULL
+    SearchDetections: +Integer Hint NULL
+    SearchDetections: +Timestamp created_at
+```


I'm continuing to wonder about the Detections table structure, and wanted to mention two alternatives.

First, a more normalized approach. The need to create a column for every detector is picking at my data normalization instincts, and I'm tempted to float an arrangement like this:

classDiagram Term --< SearchEvent : has many LinkSearchDetectors --< SearchEvent : has many LinkSearchDetectors --< Detectors : has many class Term Term: +Integer id Term: +String phrase class SearchEvent SearchEvent: +Integer id SearchEvent: +Integer term_id SearchEvent: +Timestamp created_at class LinkSearchDetectors LinkSearchDetectors: +Integer id LinkSearchDetectors: +Integer searchevent_id LinkSearchDetectors: +Integer detector_id LinkSearchDetectors: +Integer result class Detectors Detectors: +Integer id Detectors: +String name

Loading

However, I don't think we should do this. While this design is more normalized, it would make maintenance of the application harder. Creating a new detector would no longer be a purely-code exercise, accomplishable with a PR and related data migration. We would still need the PR for code change, and a migration for any data model changes, but then we would also need to create new records in the Detectors table after the migrations are complete.

The second option would be to combine the SearchEvent and SearchDetections table:

classDiagram Term --< SearchEvent : has many class Term Term: +Integer id Term: +String phrase class SearchEvent SearchEvent: +Intger id SearchEvent: +Integer term_id SearchEvent: +Integer DOI NULL SearchEvent: +Integer ISBN NULL SearchEvent: +Integer ISSN NULL SearchEvent: +Integer PMID NULL SearchEvent: +Integer Hint NULL SearchEvent: +Timestamp created_at

Loading

I don't think this makes sense either. While it does work for naturally-received repeat searches over time (every search event gets its own entry, which will trigger its own detections), this model doesn't allow us to retroactively run the sorts of monthly reports that you're envisioning - not without creating extraneous SearchEvent records for every monthly report, which will make counts inaccurate.

What I'm not sure about, though, is how to handle commonly-seen searches during these monthly reports. These reports will be run once for every Term, not for every SearchEvent - because we don't need to run web of science 72 times in every monthly report. We need to run it once. Every monthly report would end up generating a new SearchDetections record, and I'm assuming that repeat runs would be latched onto the first received SearchEvent record?

The data might look like this:

Term ID Term phrase

1 web of science

2 some custom term only seen once

SearchEvent ID Term ID Created at

1 1 June 14, 2024 8:03 am

2 2 June 21, 2024 10:14 am

3 1 July 1, 2024 4:17 pm

4 4 July 1, 2024 8:45 pm

SearchDetection ID SearchEvent ID DOI ISBN ISSN PMID Hint Created at

1 1 0 0 0 0 0 June 14, 2024 8:03 am

2 2 0 0 0 0 0 June 21, 2024 10:14 am

3 3 0 0 0 0 0 July 1, 2024 4:17 pm

4 4 0 0 0 0 1 July 4, 2024 8:45 pm

5 1 0 0 0 0 1 August 1, 2024 12:00 am

6 2 0 0 0 0 0 August 1, 2024 12:00 am

7 1 0 0 0 0 1 September 1, 2024 12:00 am

8 2 0 0 0 0 0 September 1, 2024 12:00 am

The SearchDetections records here indicate four naturally received search events and their detections (rows 1-4 in the detections table). There are also two rounds of monthly batch reports (rows 5-8), including every term, hanging off the first observation of each search term.

This feels workable to me, but I want to make sure I'm following.

jazairi

I agree with the decision here. The more verbose detections table is intriguing, but it seems to me that the extra data would not serve the intended purpose here of validating how often our algorithms are finding matches.

jazairi · 2024-06-21T16:11:37Z

docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md

+```mermaid
+classDiagram
+ class TotalMatchesByAlgorithm
+    TotalMatchesByAlgorithm: +Integer id
+    TotalMatchesByAlgorithm: +Date month 
+    TotalMatchesByAlgorithm: +Integer doi
+    TotalMatchesByAlgorithm: +Integer issn
+    TotalMatchesByAlgorithm: +Integer isbn
+    TotalMatchesByAlgorithm: +Integer pmid
+    TotalMatchesByAlgorithm: +Integer journal_title
+    TotalMatchesByAlgorithm: +Integer staff_categorized
+    TotalMatchesByAlgorithm: +Integer no_matches
+```


Regarding question 2, my preference would be to add those columns when we're ready to use them. However, I like including them in the class diagram as an indication of future state.

jazairi · 2024-06-21T16:14:02Z

docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md

+
+An approach like this, which doesn’t generate only counts but also the details of what is detected / returned, would provide greater visibility into application performance - but at the cost of needing to be recorded at query time, and not as a batch retroactively.
+
+This stores a lot more data and it isn't clear what we would use it for. This approach should likely be implemented in the future if we find we have a specific question that can only be solved with this level of detail. We would lose historical data if we wait to implement this, but that risk is outweighed by avoiding storing a bunch of data without a clear purpose.


Alternatively, we could store the data for now and drop the table if it becomes clear that there's no purpose for it. The downside of that (aside from storing a lot of data in the interim) is knowing when to decide whether it's useful or not.

I concur - knowing when to get rid of that data feels like a thornier process than just not collecting it yet until we have a more clear understanding of how we'll handle it and when to delete it.

Why are these changes being introduced: * There are multiple ways we can address this problem * Using an ADR to provide context and options, gathering feedback on the proposed solution, and then merging the decision allows us to see what else we considered in addition to what we eventually implement. Relevant ticket(s): https://mitlibraries.atlassian.net/browse/TCO-16

JPrevost · 2024-06-21T19:21:23Z

Noting that we discussed in zoom and decided to merge this PR as-is with the understanding that the database structures here were examples and we'll confirm during PRs the exact structures that will be used.

JPrevost requested review from matt-bernhardt and jazairi June 20, 2024 16:08

mitlib temporarily deployed to tacos-api-pipeline-pr-47 June 20, 2024 16:10 Inactive

matt-bernhardt approved these changes Jun 21, 2024

View reviewed changes

jazairi approved these changes Jun 21, 2024

View reviewed changes

JPrevost force-pushed the tco16-historical-snapshot-adr branch from 4ad2f10 to df559a9 Compare June 21, 2024 19:20

JPrevost merged commit 6c84afd into main Jun 21, 2024
1 check passed

JPrevost deleted the tco16-historical-snapshot-adr branch June 21, 2024 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds ADR for historical analytics approach #47

Adds ADR for historical analytics approach #47

JPrevost commented Jun 20, 2024

matt-bernhardt left a comment

matt-bernhardt Jun 21, 2024

jazairi Jun 21, 2024

JPrevost Jun 21, 2024

matt-bernhardt Jun 21, 2024

jazairi left a comment

jazairi Jun 21, 2024

jazairi Jun 21, 2024

matt-bernhardt Jun 21, 2024

JPrevost commented Jun 21, 2024

SearchEvent ID	Term ID	Created at
1	1	June 14, 2024 8:03 am
2	2	June 21, 2024 10:14 am
3	1	July 1, 2024 4:17 pm
4	4	July 1, 2024 8:45 pm


		An approach like this, which doesn’t generate only counts but also the details of what is detected / returned, would provide greater visibility into application performance - but at the cost of needing to be recorded at query time, and not as a batch retroactively.

		This stores a lot more data and it isn't clear what we would use it for. This approach should likely be implemented in the future if we find we have a specific question that can only be solved with this level of detail. We would lose historical data if we wait to implement this, but that risk is outweighed by avoiding storing a bunch of data without a clear purpose.

Adds ADR for historical analytics approach #47

Adds ADR for historical analytics approach #47

Conversation

JPrevost commented Jun 20, 2024

matt-bernhardt left a comment

Choose a reason for hiding this comment

matt-bernhardt Jun 21, 2024

Choose a reason for hiding this comment

jazairi Jun 21, 2024

Choose a reason for hiding this comment

JPrevost Jun 21, 2024

Choose a reason for hiding this comment

matt-bernhardt Jun 21, 2024

Choose a reason for hiding this comment

jazairi left a comment

Choose a reason for hiding this comment

jazairi Jun 21, 2024

Choose a reason for hiding this comment

jazairi Jun 21, 2024

Choose a reason for hiding this comment

matt-bernhardt Jun 21, 2024

Choose a reason for hiding this comment

JPrevost commented Jun 21, 2024