Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Categorization #110

Merged
merged 6 commits into from
Oct 1, 2024
Merged

Implement Categorization #110

merged 6 commits into from
Oct 1, 2024

Conversation

matt-bernhardt
Copy link
Member

@matt-bernhardt matt-bernhardt commented Sep 24, 2024

This finishes the implementation of the Categorization work:

  • Builds the method to calculate categorization scores based on which Detections have already been created
  • Records those scores in the database for later processing or analysis
  • Updates the logSearchEvent GraphQL endpoint to perform the Detection and Categorization work.
  • Updates both GraphQL endpoints to return the resulting Categorization information

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-83

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

YES migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 24, 2024 19:23 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 24, 2024 23:16 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 24, 2024 23:50 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 25, 2024 00:10 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 25, 2024 18:42 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 25, 2024 21:13 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 26, 2024 13:51 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 26, 2024 15:48 Inactive
@matt-bernhardt matt-bernhardt changed the title Tco 83 categorization Implement Categorization Sep 27, 2024
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 27, 2024 16:06 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 27, 2024 19:29 Inactive
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 27, 2024 20:18 Inactive
@matt-bernhardt matt-bernhardt marked this pull request as ready for review September 27, 2024 20:22
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 September 27, 2024 20:51 Inactive
** Why are these changes being introduced:

In order to perform the Categorization work, the Detections model needs
to be able to return information about related category and confidence
values. This method will be queried by the method doing the
actual categorization work.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-83

** How does this address that need:

* This reaches from a given Detection record, through the related
  Detector and into all the linked DetectorCategory records. The method
  combines the confidence value (on the DetectorCategory record) with
  the Category name, and builds a hash.

  The method returns an array of hashes, like [ { 1 => 0.4 } ] if the
  Detector is joined only to one category, or like
  [ { 1 => 0.4 }, { 2 => 0.95 } ] if the Detector is linked with more
  than one Category.

* There are tests for this method to confirm the structure of the
  response - because our fixtures don't naturally have any detector that
  is linked with multiple categories, that test creates one.

** Document any side effects to this change:

Hopefully none.
** Why are these changes being introduced:

The current class model calls for a Categorization model to store the
the result of categorization work, which would join Term and Category,
storing the calculated categorization score.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-83

** How does this address that need:

This creates the Categorization model, with a few features that are
worth noting:

* The Categorization model copies the Detection model in incorporating
the DETECTOR_VERSION environment variable. Having this type of
versioning for both Detection can Categorization feels justifiable, but
see the side effects section below.

* The model also copies the :current scope from the Detection model, but
  not the :for_* scopes. I'm not sure whether those will end up being
  needed.

* There are no actual methods on the Categorization model right now,
  because the model in charge of this process is going to be the Term
  model (see next commit).

* The has_many relationships on the Term and Category models are also
  added.

* Tests for all of the above are added, along with a fixture to help the
  tests.

** Document any side effects to this change:

* It is likely that the DETECTOR_VERSION variable may be better named
  ALGORITHM_VERSION or something more inclusive of both Detection and
  Categorization. For now, though, I'm choosing to leave it alone.
** Why are these changes being introduced:

With the Detection model now returning associated categories and
confidences, and the Categorization model ready to receive the outcome
of categorization, something needs to actually do the work.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-83

** How does this address that need:

This declares that the Term model will be responsible for performing the
Categorization work, pulling information from the Detections model,
shaping the response into something it can use, and overseeing the
creation of any needed Categorization records.

While each Detection record is returning its associated categories and
confidences, some Terms will end up tripping multiple Detectors - and
some of the associated Categories will end up being duplicated. Getting
all this information into a compact and processable structure is the
work of retrieve_detection_scores - which is private because nothing
should need to call that method except the calculate_categorization
method that manages all of this.

Contrarily, the work of combining all the individual detection
confidence values feels like something that needs to be testable - so
I've chosen to make that a public method for testing. Nothing else
should be calling the method, but I want to be transparent about what
caclulations we're doing.

I've tried to leave code comments in these methods to describe what the
data looks like at each step. This has been a pain to work through, and
I'm not sure I've done it right - so hopefully the comments will help
future-us.

I've tried to write the tests that seem warranted for this - but I'm
not sure that I've succeeded.

** Document any side effects to this change:

* I'm not 100% convinced that the public/private method choices I've
  made are the correct ones, and there may be aspects of this work that
  need tests as guardrails which I haven't written yet.
** Why are these changes being introduced:

Now that there are methods to kick off both the Detection and
Categorization workflows, something needs to call them in order for
stuff to actually happen.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-83

** How does this address that need:

While both the record_detections and calculate_categorizations methods
are public, and could be called independently, it makes a bit more sense
to make sure that categorization would itself make sure that the
detections is relies on are current. So now calculate_categorizations
starts by calling record_detections.

Additionally, while it was possible to call these methods via lifecycle
hooks like after_find or after_create, I'm not yet sure what other
processes will be touching these records - but I know that the GraphQL
endpoint will need to make sure these steps are happening. They should
happen even when the system receives a repeated term.

For these reasons, I've added a call to calculate_categorizations to the
logSearchEvent query in GraphQL.

** Document any side effects to this change:

* Is it a side effect that there are no tests for this change? All of
  the methods being called already have tests. Maybe an integration
  test is needed.
** Why are these changes being introduced:

Now that the logSearchEvent query is going to be initiating the
Detection and Categorization workflows, we can actually return the
results in the graphQL response for both lookupTerm and logSearchEvent.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-83

** How does this address that need:

This defines a Categories object type to the GraphQL vocabulary, and
includes an array of these objects to both the Term and SearchEvent
objects. The content of this Categories object spans both the category
name (from the Categories model) and the overall confidence value that
this term belongs to that category (the confidence value is part of the
Categorization model).

** Document any side effects to this change:

* This may be the first test that we've written for the lookupTerm
  function?
** Why are these changes being introduced:

There are some changes needed to the classes diagram, to keep things
current as we've built out this categorization process.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-83

** How does this address that need:

This updates the class diagram with the new methods, and new method
names, which have been added to the application. There were also some
misnamed classes ("DetectionCategory" instead of "DetectorCategory"),
which this fixes.

Additionally, this tries out namespaces in Mermaid to account for
class names like Detector::Journal, which haven't been included in the
diagram yet.

** Document any side effects to this change:

I may be overthinking this - but I'm finding this diagram to be very
helpful as we build out the application, and I'd like to try and keep
it as current as we can.
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-110 October 1, 2024 15:17 Inactive
@JPrevost JPrevost self-assigned this Oct 1, 2024
@matt-bernhardt matt-bernhardt merged commit 862a979 into main Oct 1, 2024
3 checks passed
@matt-bernhardt matt-bernhardt deleted the tco-83-categorization branch October 1, 2024 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants