-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Categorization #110
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
September 24, 2024 23:16
5ea9085
to
112a4e9
Compare
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
September 24, 2024 23:50
112a4e9
to
912599e
Compare
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
September 25, 2024 00:10
912599e
to
9728953
Compare
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
September 25, 2024 18:41
9728953
to
f9e34c3
Compare
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
September 25, 2024 21:13
f9e34c3
to
cad80fb
Compare
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
September 26, 2024 13:50
cad80fb
to
10e4160
Compare
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
September 26, 2024 15:48
10e4160
to
02bc065
Compare
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
September 27, 2024 16:05
02bc065
to
dc08f5f
Compare
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
September 27, 2024 19:29
dc08f5f
to
1a724e0
Compare
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
September 27, 2024 20:18
1a724e0
to
e23eca6
Compare
** Why are these changes being introduced: In order to perform the Categorization work, the Detections model needs to be able to return information about related category and confidence values. This method will be queried by the method doing the actual categorization work. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: * This reaches from a given Detection record, through the related Detector and into all the linked DetectorCategory records. The method combines the confidence value (on the DetectorCategory record) with the Category name, and builds a hash. The method returns an array of hashes, like [ { 1 => 0.4 } ] if the Detector is joined only to one category, or like [ { 1 => 0.4 }, { 2 => 0.95 } ] if the Detector is linked with more than one Category. * There are tests for this method to confirm the structure of the response - because our fixtures don't naturally have any detector that is linked with multiple categories, that test creates one. ** Document any side effects to this change: Hopefully none.
** Why are these changes being introduced: The current class model calls for a Categorization model to store the the result of categorization work, which would join Term and Category, storing the calculated categorization score. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: This creates the Categorization model, with a few features that are worth noting: * The Categorization model copies the Detection model in incorporating the DETECTOR_VERSION environment variable. Having this type of versioning for both Detection can Categorization feels justifiable, but see the side effects section below. * The model also copies the :current scope from the Detection model, but not the :for_* scopes. I'm not sure whether those will end up being needed. * There are no actual methods on the Categorization model right now, because the model in charge of this process is going to be the Term model (see next commit). * The has_many relationships on the Term and Category models are also added. * Tests for all of the above are added, along with a fixture to help the tests. ** Document any side effects to this change: * It is likely that the DETECTOR_VERSION variable may be better named ALGORITHM_VERSION or something more inclusive of both Detection and Categorization. For now, though, I'm choosing to leave it alone.
** Why are these changes being introduced: With the Detection model now returning associated categories and confidences, and the Categorization model ready to receive the outcome of categorization, something needs to actually do the work. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: This declares that the Term model will be responsible for performing the Categorization work, pulling information from the Detections model, shaping the response into something it can use, and overseeing the creation of any needed Categorization records. While each Detection record is returning its associated categories and confidences, some Terms will end up tripping multiple Detectors - and some of the associated Categories will end up being duplicated. Getting all this information into a compact and processable structure is the work of retrieve_detection_scores - which is private because nothing should need to call that method except the calculate_categorization method that manages all of this. Contrarily, the work of combining all the individual detection confidence values feels like something that needs to be testable - so I've chosen to make that a public method for testing. Nothing else should be calling the method, but I want to be transparent about what caclulations we're doing. I've tried to leave code comments in these methods to describe what the data looks like at each step. This has been a pain to work through, and I'm not sure I've done it right - so hopefully the comments will help future-us. I've tried to write the tests that seem warranted for this - but I'm not sure that I've succeeded. ** Document any side effects to this change: * I'm not 100% convinced that the public/private method choices I've made are the correct ones, and there may be aspects of this work that need tests as guardrails which I haven't written yet.
** Why are these changes being introduced: Now that there are methods to kick off both the Detection and Categorization workflows, something needs to call them in order for stuff to actually happen. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: While both the record_detections and calculate_categorizations methods are public, and could be called independently, it makes a bit more sense to make sure that categorization would itself make sure that the detections is relies on are current. So now calculate_categorizations starts by calling record_detections. Additionally, while it was possible to call these methods via lifecycle hooks like after_find or after_create, I'm not yet sure what other processes will be touching these records - but I know that the GraphQL endpoint will need to make sure these steps are happening. They should happen even when the system receives a repeated term. For these reasons, I've added a call to calculate_categorizations to the logSearchEvent query in GraphQL. ** Document any side effects to this change: * Is it a side effect that there are no tests for this change? All of the methods being called already have tests. Maybe an integration test is needed.
** Why are these changes being introduced: Now that the logSearchEvent query is going to be initiating the Detection and Categorization workflows, we can actually return the results in the graphQL response for both lookupTerm and logSearchEvent. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: This defines a Categories object type to the GraphQL vocabulary, and includes an array of these objects to both the Term and SearchEvent objects. The content of this Categories object spans both the category name (from the Categories model) and the overall confidence value that this term belongs to that category (the confidence value is part of the Categorization model). ** Document any side effects to this change: * This may be the first test that we've written for the lookupTerm function?
** Why are these changes being introduced: There are some changes needed to the classes diagram, to keep things current as we've built out this categorization process. ** Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/tco-83 ** How does this address that need: This updates the class diagram with the new methods, and new method names, which have been added to the application. There were also some misnamed classes ("DetectionCategory" instead of "DetectorCategory"), which this fixes. Additionally, this tries out namespaces in Mermaid to account for class names like Detector::Journal, which haven't been included in the diagram yet. ** Document any side effects to this change: I may be overthinking this - but I'm finding this diagram to be very helpful as we build out the application, and I'd like to try and keep it as current as we can.
matt-bernhardt
force-pushed
the
tco-83-categorization
branch
from
October 1, 2024 15:17
e23eca6
to
590988a
Compare
JPrevost
approved these changes
Oct 1, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This finishes the implementation of the Categorization work:
Developer
Ticket(s)
https://mitlibraries.atlassian.net/browse/TCO-83
Accessibility
all issues introduced by these changes have been resolved or opened
as new issues (link to those issues in the Pull Request details above)
Documentation
ENV
Stakeholders
Dependencies and migrations
NO dependencies are updated
YES migrations are included
Reviewer
Code
added technical debt.
Documentation
(not just this pull request message).
Testing