Implement Categorization #110

matt-bernhardt · 2024-09-24T19:21:24Z

This finishes the implementation of the Categorization work:

Builds the method to calculate categorization scores based on which Detections have already been created
Records those scores in the database for later processing or analysis
Updates the logSearchEvent GraphQL endpoint to perform the Detection and Categorization work.
Updates both GraphQL endpoints to return the resulting Categorization information

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-83

Accessibility

ANDI or Wave has been run in accordance to our guide and
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
There are no accessibility implications to this change

Documentation

Project documentation has been updated, and yard output previewed
No documentation changes are needed

ENV

All new ENV is documented in README.
All new ENV has been added to Heroku Pipeline, Staging and Prod.
ENV has not changed.

Stakeholders

Stakeholder approval has been confirmed
Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

YES migrations are included

Reviewer

Code

I have confirmed that the code works as intended.
Any CodeClimate issues have been fixed or confirmed as
added technical debt.

Documentation

The commit message is clear and follows our guidelines
(not just this pull request message).
The documentation has been updated or is unnecessary.
New dependencies are appropriate or there were no changes.

Testing

There are appropriate tests covering any new functionality.
No additional test coverage is required.

** Why are these changes being introduced: In order to perform the Categorization work, the Detections model needs to be able to return information about related category and confidence values. This method will be queried by the method doing the actual categorization work. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: * This reaches from a given Detection record, through the related Detector and into all the linked DetectorCategory records. The method combines the confidence value (on the DetectorCategory record) with the Category name, and builds a hash. The method returns an array of hashes, like [ { 1 => 0.4 } ] if the Detector is joined only to one category, or like [ { 1 => 0.4 }, { 2 => 0.95 } ] if the Detector is linked with more than one Category. * There are tests for this method to confirm the structure of the response - because our fixtures don't naturally have any detector that is linked with multiple categories, that test creates one. ** Document any side effects to this change: Hopefully none.

** Why are these changes being introduced: The current class model calls for a Categorization model to store the the result of categorization work, which would join Term and Category, storing the calculated categorization score. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: This creates the Categorization model, with a few features that are worth noting: * The Categorization model copies the Detection model in incorporating the DETECTOR_VERSION environment variable. Having this type of versioning for both Detection can Categorization feels justifiable, but see the side effects section below. * The model also copies the :current scope from the Detection model, but not the :for_* scopes. I'm not sure whether those will end up being needed. * There are no actual methods on the Categorization model right now, because the model in charge of this process is going to be the Term model (see next commit). * The has_many relationships on the Term and Category models are also added. * Tests for all of the above are added, along with a fixture to help the tests. ** Document any side effects to this change: * It is likely that the DETECTOR_VERSION variable may be better named ALGORITHM_VERSION or something more inclusive of both Detection and Categorization. For now, though, I'm choosing to leave it alone.

** Why are these changes being introduced: With the Detection model now returning associated categories and confidences, and the Categorization model ready to receive the outcome of categorization, something needs to actually do the work. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: This declares that the Term model will be responsible for performing the Categorization work, pulling information from the Detections model, shaping the response into something it can use, and overseeing the creation of any needed Categorization records. While each Detection record is returning its associated categories and confidences, some Terms will end up tripping multiple Detectors - and some of the associated Categories will end up being duplicated. Getting all this information into a compact and processable structure is the work of retrieve_detection_scores - which is private because nothing should need to call that method except the calculate_categorization method that manages all of this. Contrarily, the work of combining all the individual detection confidence values feels like something that needs to be testable - so I've chosen to make that a public method for testing. Nothing else should be calling the method, but I want to be transparent about what caclulations we're doing. I've tried to leave code comments in these methods to describe what the data looks like at each step. This has been a pain to work through, and I'm not sure I've done it right - so hopefully the comments will help future-us. I've tried to write the tests that seem warranted for this - but I'm not sure that I've succeeded. ** Document any side effects to this change: * I'm not 100% convinced that the public/private method choices I've made are the correct ones, and there may be aspects of this work that need tests as guardrails which I haven't written yet.

** Why are these changes being introduced: Now that there are methods to kick off both the Detection and Categorization workflows, something needs to call them in order for stuff to actually happen. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: While both the record_detections and calculate_categorizations methods are public, and could be called independently, it makes a bit more sense to make sure that categorization would itself make sure that the detections is relies on are current. So now calculate_categorizations starts by calling record_detections. Additionally, while it was possible to call these methods via lifecycle hooks like after_find or after_create, I'm not yet sure what other processes will be touching these records - but I know that the GraphQL endpoint will need to make sure these steps are happening. They should happen even when the system receives a repeated term. For these reasons, I've added a call to calculate_categorizations to the logSearchEvent query in GraphQL. ** Document any side effects to this change: * Is it a side effect that there are no tests for this change? All of the methods being called already have tests. Maybe an integration test is needed.

** Why are these changes being introduced: Now that the logSearchEvent query is going to be initiating the Detection and Categorization workflows, we can actually return the results in the graphQL response for both lookupTerm and logSearchEvent. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: This defines a Categories object type to the GraphQL vocabulary, and includes an array of these objects to both the Term and SearchEvent objects. The content of this Categories object spans both the category name (from the Categories model) and the overall confidence value that this term belongs to that category (the confidence value is part of the Categorization model). ** Document any side effects to this change: * This may be the first test that we've written for the lookupTerm function?

** Why are these changes being introduced: There are some changes needed to the classes diagram, to keep things current as we've built out this categorization process. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: This updates the class diagram with the new methods, and new method names, which have been added to the application. There were also some misnamed classes ("DetectionCategory" instead of "DetectorCategory"), which this fixes. Additionally, this tries out namespaces in Mermaid to account for class names like Detector::Journal, which haven't been included in the diagram yet. ** Document any side effects to this change: I may be overthinking this - but I'm finding this diagram to be very helpful as we build out the application, and I'd like to try and keep it as current as we can.

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 24, 2024 19:23 Inactive

matt-bernhardt force-pushed the tco-83-categorization branch from 5ea9085 to 112a4e9 Compare September 24, 2024 23:16

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 24, 2024 23:16 Inactive

matt-bernhardt force-pushed the tco-83-categorization branch from 112a4e9 to 912599e Compare September 24, 2024 23:50

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 24, 2024 23:50 Inactive

matt-bernhardt force-pushed the tco-83-categorization branch from 912599e to 9728953 Compare September 25, 2024 00:10

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 25, 2024 00:10 Inactive

matt-bernhardt force-pushed the tco-83-categorization branch from 9728953 to f9e34c3 Compare September 25, 2024 18:41

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 25, 2024 18:42 Inactive

matt-bernhardt force-pushed the tco-83-categorization branch from f9e34c3 to cad80fb Compare September 25, 2024 21:13

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 25, 2024 21:13 Inactive

matt-bernhardt force-pushed the tco-83-categorization branch from cad80fb to 10e4160 Compare September 26, 2024 13:50

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 26, 2024 13:51 Inactive

matt-bernhardt force-pushed the tco-83-categorization branch from 10e4160 to 02bc065 Compare September 26, 2024 15:48

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 26, 2024 15:48 Inactive

matt-bernhardt changed the title ~~Tco 83 categorization~~ Implement Categorization Sep 27, 2024

matt-bernhardt force-pushed the tco-83-categorization branch from 02bc065 to dc08f5f Compare September 27, 2024 16:05

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 27, 2024 16:06 Inactive

matt-bernhardt force-pushed the tco-83-categorization branch from dc08f5f to 1a724e0 Compare September 27, 2024 19:29

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 27, 2024 19:29 Inactive

matt-bernhardt force-pushed the tco-83-categorization branch from 1a724e0 to e23eca6 Compare September 27, 2024 20:18

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 27, 2024 20:18 Inactive

matt-bernhardt marked this pull request as ready for review September 27, 2024 20:22

matt-bernhardt requested review from JPrevost and jazairi September 27, 2024 20:22

mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 27, 2024 20:51 Inactive

matt-bernhardt added 4 commits October 1, 2024 11:16

matt-bernhardt added 2 commits October 1, 2024 11:16

matt-bernhardt force-pushed the tco-83-categorization branch from e23eca6 to 590988a Compare October 1, 2024 15:17

mitlib temporarily deployed to tacos-api-pipeline-pr-110 October 1, 2024 15:17 Inactive

JPrevost self-assigned this Oct 1, 2024

JPrevost approved these changes Oct 1, 2024

View reviewed changes

matt-bernhardt merged commit 862a979 into main Oct 1, 2024
3 checks passed

matt-bernhardt deleted the tco-83-categorization branch October 1, 2024 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Categorization #110

Implement Categorization #110

matt-bernhardt commented Sep 24, 2024 •

edited by JPrevost

Loading

Implement Categorization #110

Implement Categorization #110

Conversation

matt-bernhardt commented Sep 24, 2024 • edited by JPrevost Loading

Developer

Ticket(s)

Accessibility

Documentation

ENV

Stakeholders

Dependencies and migrations

Reviewer

Code

Documentation

Testing

matt-bernhardt commented Sep 24, 2024 •

edited by JPrevost

Loading