Skip to content

Commit

Permalink
Total aggregate matches
Browse files Browse the repository at this point in the history
Why are these changes being introduced:

* Implement data models for counting algorithm matches for all Terms

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/TCO-17

See also:

* https://github.com/MITLibraries/tacos/blob/main/docs/architecture-decisions/0005-use-multiple-minimal-historical-analytics-models.md

How does this address that need:

* Creates a new model `AggregateMatch`
* Adds methods to run each (current) StandardIdentifier algorithm on
each Term (via the SearchEvents)
* Adjusts `MontlyMatch` counting algorithm to be useful for both cases
  and extracts it to a module which is imported into both Classes

Document any side effects to this change:

* A schedulable job to run this automatically is out of scope and will
  be added under a separate ticket
* The tests are identical between this and `MontlyMatch`. There may be a
  way to avoid the duplication and thus ensure both get relevant updates
  but it was not clear to me how to do that in an obvious way at the
  time of this work.
  • Loading branch information
JPrevost committed Jun 26, 2024
1 parent 1da6b07 commit d4972fa
Show file tree
Hide file tree
Showing 8 changed files with 198 additions and 42 deletions.
36 changes: 36 additions & 0 deletions app/models/aggregate_match.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# frozen_string_literal: true

# == Schema Information
#
# Table name: aggregate_matches
#
# id :integer not null, primary key
# doi :integer
# issn :integer
# isbn :integer
# pmid :integer
# unmatched :integer
# created_at :datetime not null
# updated_at :datetime not null
#

# AggregateMatch aggregates statistics for matches for all SearchEvents
#
# @see MonthlyMatch
class AggregateMatch < ApplicationRecord
include MatchCounter

# generate data for all SearchEvents
#
# @note This is expected to only be run once per month, ideally at the beginning of the following monthto ensure as
# accurate as possible statistics. Running further from the month in question will work, but matches will use the
# current versions of all algorithms which may not allow for tracking algorithm performance
# over time as accurately as intended.
# @todo Prevent running more than once by checking if we have data and then erroring?
# @return [AggregateMatch] The created AggregateMatch object.
def generate
matches = count_matches(SearchEvent.all)
AggregateMatch.create(doi: matches[:doi], issn: matches[:issn], isbn: matches[:isbn],
pmid: matches[:pmid], unmatched: matches[:unmatched])
end
end
27 changes: 27 additions & 0 deletions app/models/match_counter.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# frozen_string_literal: true

# Counts matches supplied events
module MatchCounter
# Counts matches supplied events
#
# @note We currently only have StandardIdentifiers to match. As we add new algorithms, this method will need to
# expand to handle additional match types.
# @param events [Array of SearchEvents] An array of SearchEvents to check for matches.
# @return [Hash] A Hash with keys for each known standard identifier and the count of matched search events.
def count_matches(events)
matches = Hash.new(0)
known_ids = %i[unmatched pmid isbn issn doi]

events.each do |event|
ids = StandardIdentifiers.new(event.term.phrase)

matches[:unmatched] += 1 if ids.identifiers.blank?

known_ids.each do |id|
matches[id] += 1 if ids.identifiers[id].present?
end
end

matches
end
end
39 changes: 4 additions & 35 deletions app/models/monthly_match.rb
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
#
# @see AggregateMatch
class MonthlyMatch < ApplicationRecord
include MatchCounter

# generate data for a provided month
#
# @note This is expected to only be run once per month, ideally at the beginning of the following monthto ensure as
Expand All @@ -28,42 +30,9 @@ class MonthlyMatch < ApplicationRecord
# @todo Prevent running more than once by checking if we have data and then erroring.
# @param month [DateTime] A DateTime object within the `month` to be generated.
# @return [MonthlyMatch] The created MonthlyMatch object.
def generate_monthly(month)
matches = count_matches(month)
def generate(month)
matches = count_matches(SearchEvent.single_month(month))
MonthlyMatch.create(month:, doi: matches[:doi], issn: matches[:issn], isbn: matches[:isbn],
pmid: matches[:pmid], unmatched: matches[:unmatched])
end

# Counts matches for the given month
#
# @note We currently only have StandardIdentifiers to match. As we add new algorithms, this method will need to
# expand to handle additional match types.
# @param month [DateTime] A DateTime object within the `month` to be generated.
# @return [Hash] A Hash with keys for each known standard identifier and the count of matched search events.
def count_matches(month)
matches = Hash.new(0)
known_ids = %i[unmatched pmid isbn issn doi]

SearchEvent.single_month(month).each do |event|
ids = StandardIdentifiers.new(event.term.phrase)

matches[:unmatched] += 1 if ids.identifiers.blank?

known_ids.each do |id|
matches[id] += 1 if standard_identifier_match?(id, ids)
end
end

matches
end

# Returns true if the provided identifier type was matched in this SearchEvent
#
# @param identifier [symbol,string] A specific StandardIdentifier type to look for in the SearchEvent, such as `pmid`
# or `doi`. We use symbols, but it supports strings as well.
# @param ids [StandardIdentifiers, Hash] A Hash with matches for know standard identifiers.
# @return [Hash] A Hash with keys for each known standard identifier and the count of matched search events.
def standard_identifier_match?(identifier, ids)
true if ids.identifiers[identifier].present?
end
end
12 changes: 12 additions & 0 deletions db/migrate/20240621132150_create_aggregate_matches.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
class CreateAggregateMatches < ActiveRecord::Migration[7.1]
def change
create_table :aggregate_matches do |t|
t.integer :doi
t.integer :issn
t.integer :isbn
t.integer :pmid
t.integer :unmatched
t.timestamps
end
end
end
12 changes: 11 additions & 1 deletion db/schema.rb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 23 additions & 0 deletions test/fixtures/aggregate_matches.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# == Schema Information
#
# Table name: aggregate_matches
#
# id :integer not null, primary key
# doi :integer
# issn :integer
# isbn :integer
# pmid :integer
# unmatched :integer
# created_at :datetime not null
# updated_at :datetime not null
#

# This model initially had no columns defined. If you add columns to the
# model remove the "{}" from the fixture names and add the columns immediately
# below each fixture, per the syntax in the comments below
#
one: {}
# column: value
#
two: {}
# column: value
79 changes: 79 additions & 0 deletions test/models/aggregate_match_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# == Schema Information
#
# Table name: aggregate_matches
#
# id :integer not null, primary key
# doi :integer
# issn :integer
# isbn :integer
# pmid :integer
# unmatched :integer
# created_at :datetime not null
# updated_at :datetime not null
#
require 'test_helper'

class AggregateMatchTest < ActiveSupport::TestCase
test 'dois counts are included in aggregation' do
aggregate = MonthlyMatch.new.generate(DateTime.now)
assert aggregate.doi == 1
end

test 'issns counts are included in aggregation' do
aggregate = MonthlyMatch.new.generate(DateTime.now)
assert aggregate.issn == 1
end

test 'isbns counts are included in aggregation' do
aggregate = MonthlyMatch.new.generate(DateTime.now)
assert aggregate.isbn == 1
end

test 'pmids counts are included in aggregation' do
aggregate = MonthlyMatch.new.generate(DateTime.now)
assert aggregate.pmid == 1
end

test 'unmatched counts are included are included in aggregation' do
aggregate = MonthlyMatch.new.generate(DateTime.now)
assert aggregate.unmatched == 2
end

test 'creating lots of searchevents leads to correct data' do
# drop all searchevents to make math easier and minimize fragility over time as more fixtures are created
SearchEvent.delete_all

doi_expected_count = rand(1...100)
doi_expected_count.times do
SearchEvent.create(term: terms(:doi), source: 'test')
end

issn_expected_count = rand(1...100)
issn_expected_count.times do
SearchEvent.create(term: terms(:issn_1075_8623), source: 'test')
end

isbn_expected_count = rand(1...100)
isbn_expected_count.times do
SearchEvent.create(term: terms(:isbn_9781319145446), source: 'test')
end

pmid_expected_count = rand(1...100)
pmid_expected_count.times do
SearchEvent.create(term: terms(:pmid_38908367), source: 'test')
end

unmatched_expected_count = rand(1...100)
unmatched_expected_count.times do
SearchEvent.create(term: terms(:hi), source: 'test')
end

aggregate = MonthlyMatch.new.generate(DateTime.now)

assert doi_expected_count == aggregate.doi
assert issn_expected_count == aggregate.issn
assert isbn_expected_count == aggregate.isbn
assert pmid_expected_count == aggregate.pmid
assert unmatched_expected_count == aggregate.unmatched
end
end
12 changes: 6 additions & 6 deletions test/models/monthly_match_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,27 +16,27 @@

class MonthlyMatchTest < ActiveSupport::TestCase
test 'dois counts are included in aggregation' do
aggregate = MonthlyMatch.new.generate_monthly(DateTime.now)
aggregate = MonthlyMatch.new.generate(DateTime.now)
assert aggregate.doi == 1
end

test 'issns counts are included in aggregation' do
aggregate = MonthlyMatch.new.generate_monthly(DateTime.now)
aggregate = MonthlyMatch.new.generate(DateTime.now)
assert aggregate.issn == 1
end

test 'isbns counts are included in aggregation' do
aggregate = MonthlyMatch.new.generate_monthly(DateTime.now)
aggregate = MonthlyMatch.new.generate(DateTime.now)
assert aggregate.isbn == 1
end

test 'pmids counts are included in aggregation' do
aggregate = MonthlyMatch.new.generate_monthly(DateTime.now)
aggregate = MonthlyMatch.new.generate(DateTime.now)
assert aggregate.pmid == 1
end

test 'unmatched counts are included are included in aggregation' do
aggregate = MonthlyMatch.new.generate_monthly(DateTime.now)
aggregate = MonthlyMatch.new.generate(DateTime.now)
assert aggregate.unmatched == 2
end

Expand Down Expand Up @@ -69,7 +69,7 @@ class MonthlyMatchTest < ActiveSupport::TestCase
SearchEvent.create(term: terms(:hi), source: 'test')
end

aggregate = MonthlyMatch.new.generate_monthly(DateTime.now)
aggregate = MonthlyMatch.new.generate(DateTime.now)

assert doi_expected_count == aggregate.doi
assert issn_expected_count == aggregate.issn
Expand Down

0 comments on commit d4972fa

Please sign in to comment.