-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds ADR for historical analytics approach #47
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to tentatively approve this ADR, subject to some of the questions I raise above. My strongest question is whether you're proposing these exact columns (including journal_title and staff_categorized), or if those are only illustrative.
My other comment is me raising some alternative structures, and working through why I think they don't actually work. I'm not sure whether this means that we should list them in the ADR text, but - unless you see something in here that I don't - I don't think they change the decision you're proposing.
```mermaid | ||
classDiagram | ||
class TotalMatchesByAlgorithm | ||
TotalMatchesByAlgorithm: +Integer id | ||
TotalMatchesByAlgorithm: +Date month | ||
TotalMatchesByAlgorithm: +Integer doi | ||
TotalMatchesByAlgorithm: +Integer issn | ||
TotalMatchesByAlgorithm: +Integer isbn | ||
TotalMatchesByAlgorithm: +Integer pmid | ||
TotalMatchesByAlgorithm: +Integer journal_title | ||
TotalMatchesByAlgorithm: +Integer staff_categorized | ||
TotalMatchesByAlgorithm: +Integer no_matches | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions about this...
- I'm correct that both the
TotalMatchesByAlgorithm
andMonthlyMatchesByAlgorithm
have identical structures, so the difference is entirely in the code that populates them? - At the moment, we have application code that would populate the doi, issn, ibn, and pmid columns - but the journal_title and staff_categorized columns feel unneeded at this moment. Would we create them now because we've mostly decided that we want to have them? Or would we wait until we actually can put something in them before creating the column?
- More philosophically, the care and feeding of this application would require us to keep this data model in sync with the detectors that we've implemented - so it feels like this would be a good candidate for inclusion in the PR template for this app?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding question 2, my preference would be to add those columns when we're ready to use them. However, I like including them in the class diagram as an indication of future state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes
- yes, we'd only add fields as we need them not predict what we might need later
- good point. Adding a "if this PR adds a new algorithm please make sure analytics account for it" kinda thing makes sense.
class SearchDetections | ||
SearchDetections: +Integer id | ||
SearchDetections: +Integer searchevent_id | ||
SearchDetections: +Integer DOI NULL | ||
SearchDetections: +Integer ISBN NULL | ||
SearchDetections: +Integer ISSN NULL | ||
SearchDetections: +Integer PMID NULL | ||
SearchDetections: +Integer Hint NULL | ||
SearchDetections: +Timestamp created_at | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm continuing to wonder about the Detections table structure, and wanted to mention two alternatives.
First, a more normalized approach. The need to create a column for every detector is picking at my data normalization instincts, and I'm tempted to float an arrangement like this:
classDiagram
Term --< SearchEvent : has many
LinkSearchDetectors --< SearchEvent : has many
LinkSearchDetectors --< Detectors : has many
class Term
Term: +Integer id
Term: +String phrase
class SearchEvent
SearchEvent: +Integer id
SearchEvent: +Integer term_id
SearchEvent: +Timestamp created_at
class LinkSearchDetectors
LinkSearchDetectors: +Integer id
LinkSearchDetectors: +Integer searchevent_id
LinkSearchDetectors: +Integer detector_id
LinkSearchDetectors: +Integer result
class Detectors
Detectors: +Integer id
Detectors: +String name
However, I don't think we should do this. While this design is more normalized, it would make maintenance of the application harder. Creating a new detector would no longer be a purely-code exercise, accomplishable with a PR and related data migration. We would still need the PR for code change, and a migration for any data model changes, but then we would also need to create new records in the Detectors table after the migrations are complete.
The second option would be to combine the SearchEvent and SearchDetections table:
classDiagram
Term --< SearchEvent : has many
class Term
Term: +Integer id
Term: +String phrase
class SearchEvent
SearchEvent: +Intger id
SearchEvent: +Integer term_id
SearchEvent: +Integer DOI NULL
SearchEvent: +Integer ISBN NULL
SearchEvent: +Integer ISSN NULL
SearchEvent: +Integer PMID NULL
SearchEvent: +Integer Hint NULL
SearchEvent: +Timestamp created_at
I don't think this makes sense either. While it does work for naturally-received repeat searches over time (every search event gets its own entry, which will trigger its own detections), this model doesn't allow us to retroactively run the sorts of monthly reports that you're envisioning - not without creating extraneous SearchEvent records for every monthly report, which will make counts inaccurate.
What I'm not sure about, though, is how to handle commonly-seen searches during these monthly reports. These reports will be run once for every Term, not for every SearchEvent - because we don't need to run web of science
72 times in every monthly report. We need to run it once. Every monthly report would end up generating a new SearchDetections record, and I'm assuming that repeat runs would be latched onto the first received SearchEvent record?
The data might look like this:
Term ID | Term phrase |
---|---|
1 | web of science |
2 | some custom term only seen once |
SearchEvent ID | Term ID | Created at |
---|---|---|
1 | 1 | June 14, 2024 8:03 am |
2 | 2 | June 21, 2024 10:14 am |
3 | 1 | July 1, 2024 4:17 pm |
4 | 4 | July 1, 2024 8:45 pm |
SearchDetection ID | SearchEvent ID | DOI | ISBN | ISSN | PMID | Hint | Created at |
---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 0 | 0 | June 14, 2024 8:03 am |
2 | 2 | 0 | 0 | 0 | 0 | 0 | June 21, 2024 10:14 am |
3 | 3 | 0 | 0 | 0 | 0 | 0 | July 1, 2024 4:17 pm |
4 | 4 | 0 | 0 | 0 | 0 | 1 | July 4, 2024 8:45 pm |
5 | 1 | 0 | 0 | 0 | 0 | 1 | August 1, 2024 12:00 am |
6 | 2 | 0 | 0 | 0 | 0 | 0 | August 1, 2024 12:00 am |
7 | 1 | 0 | 0 | 0 | 0 | 1 | September 1, 2024 12:00 am |
8 | 2 | 0 | 0 | 0 | 0 | 0 | September 1, 2024 12:00 am |
The SearchDetections records here indicate four naturally received search events and their detections (rows 1-4 in the detections table). There are also two rounds of monthly batch reports (rows 5-8), including every term, hanging off the first observation of each search term.
This feels workable to me, but I want to make sure I'm following.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the decision here. The more verbose detections table is intriguing, but it seems to me that the extra data would not serve the intended purpose here of validating how often our algorithms are finding matches.
```mermaid | ||
classDiagram | ||
class TotalMatchesByAlgorithm | ||
TotalMatchesByAlgorithm: +Integer id | ||
TotalMatchesByAlgorithm: +Date month | ||
TotalMatchesByAlgorithm: +Integer doi | ||
TotalMatchesByAlgorithm: +Integer issn | ||
TotalMatchesByAlgorithm: +Integer isbn | ||
TotalMatchesByAlgorithm: +Integer pmid | ||
TotalMatchesByAlgorithm: +Integer journal_title | ||
TotalMatchesByAlgorithm: +Integer staff_categorized | ||
TotalMatchesByAlgorithm: +Integer no_matches | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding question 2, my preference would be to add those columns when we're ready to use them. However, I like including them in the class diagram as an indication of future state.
|
||
An approach like this, which doesn’t generate only counts but also the details of what is detected / returned, would provide greater visibility into application performance - but at the cost of needing to be recorded at query time, and not as a batch retroactively. | ||
|
||
This stores a lot more data and it isn't clear what we would use it for. This approach should likely be implemented in the future if we find we have a specific question that can only be solved with this level of detail. We would lose historical data if we wait to implement this, but that risk is outweighed by avoiding storing a bunch of data without a clear purpose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could store the data for now and drop the table if it becomes clear that there's no purpose for it. The downside of that (aside from storing a lot of data in the interim) is knowing when to decide whether it's useful or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur - knowing when to get rid of that data feels like a thornier process than just not collecting it yet until we have a more clear understanding of how we'll handle it and when to delete it.
Why are these changes being introduced: * There are multiple ways we can address this problem * Using an ADR to provide context and options, gathering feedback on the proposed solution, and then merging the decision allows us to see what else we considered in addition to what we eventually implement. Relevant ticket(s): https://mitlibraries.atlassian.net/browse/TCO-16
4ad2f10
to
df559a9
Compare
Noting that we discussed in zoom and decided to merge this PR as-is with the understanding that the database structures here were examples and we'll confirm during PRs the exact structures that will be used. |
Why are these changes being introduced:
Relevant ticket(s):
https://mitlibraries.atlassian.net/browse/TCO-16