Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BoW metric implementation #10

Open
bertsky opened this issue Mar 1, 2023 · 7 comments
Open

BoW metric implementation #10

bertsky opened this issue Mar 1, 2023 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@bertsky
Copy link

bertsky commented Mar 1, 2023

The numerator of the metrics called BoWs and BagOfWords only counts the false negatives:

def bag_of_tokens(reference_tokens: List[str], candidate_tokens: List[str]) -> int:
"""Calculate intersection/difference
between reference and candidate token list
"""
return len(_diff(reference_tokens, candidate_tokens))
def _diff(gt_tokens, cd_tokens) -> List[str]:
return list((Counter(gt_tokens) - Counter(cd_tokens)).elements())

This is then combined with a denominator that counts the length of the reference:

def accuracy_for(the_obj) -> float:
"""Calculate accuracy as ratio of
correct items, with correct items
being expected items minus
number of differences.
Respect following corner cases:
* if less correct items than differences => 0
* if both correct items and differences eq zero => 1
means: nothing to find and it did detect nothing
(i.e. no false-positives)
Args:
the_obj (object): object containing information
about reference data and difference
Returns:
float: accuracy in range 0.0 - 1.0
"""
_inspect_calculation_object(the_obj)
diffs = the_obj.diff
n_refs = len(the_obj._data_reference)
if (n_refs - diffs) < 0:
return 0
if n_refs == 0 and diffs == 0:
return 1.0
elif n_refs > 0:
return (n_refs - diffs) / n_refs

Together, this yields a pure recall rate calculation.

But for recall there is already an equivalent calculation via NLTK's metrics. So I guess this should really be a calculation for BoW accuracy, and therefore can be considered a bug. To get the correct numerator for accuracy/error, just add the inverse diff, i.e. the counts of the false positives.

Also, the function names accuracy_for and error_for are misleading: Not only are these unnormaled rates, artificially clipped to the [0,1] interval. But more importantly, they should use the sum of both lengths (reference and candidate) as denominator.

@M3ssman M3ssman added the bug Something isn't working label Mar 14, 2023
@M3ssman M3ssman self-assigned this Mar 14, 2023
@M3ssman
Copy link
Member

M3ssman commented Mar 14, 2023

Thanks for your suggestions!

Regarding your first remark on BoW, I just did some more tests and found that you're right.
If the candidate is not just blank but contains additional words, false positives, they are not considered.
This is not intended and therefore indeed buggy.

Further, the calculation for this metric will be completely redone to fit what's been written in the current OCR-D evaluation specification using a multiset rather than the current Counter-set under the hood.

@bertsky
Copy link
Author

bertsky commented Mar 14, 2023

Further, the calculation for this metric will be completely redone to fit what's been written in the current OCR-D evaluation specification using a multiset rather than the current Counter-set under the hood.

I do think your Counter set method is correct. It perfectly reflects the OCR-D eval spec (which itself based on the count-based success measure by PRImA, as opposed to index-based, which you often see in information retrieval contexts).

@M3ssman
Copy link
Member

M3ssman commented Mar 15, 2023

@bertsky
Where do the formulas in the OCR-D eval spec originate from?
Is a reference implementation for the BoW Error public available?

What's the purpose of the wikipage for evaluation?
Looks like this eval tool is not listed there 🙂

@bertsky
Copy link
Author

bertsky commented Mar 21, 2023

Where do the formulas in the OCR-D eval spec originate from?

they have some references at the end, plus my suggestions in OCR-D/spec#240

Is a reference implementation for the BoW Error public available?

not really I found, which surprised me too. See discussion on my reviews of the ocrd_eval spec.

What's the purpose of the wikipage for evaluation?

Not sure where this fits within the ocrd-website and spec. But it states quite clearly…

Problem statement

Which data and tools can we use to objectively measure quality and compare results of both complete workflows and individual steps (beyond final text Character Error Rate) on a non-representative sample?

Looks like this eval tool is not listed there

Indeed. But notice edited this page on Aug 20, 2020.

So edit along!

@einspunktnull einspunktnull self-assigned this May 24, 2023
@einspunktnull
Copy link
Collaborator

@bertsky
I changed the BoW metric according to ur request. See the latest commit in issue branch https://github.com/ulb-sachsen-anhalt/digital-eval/tree/bow_metric_impl_%2310.
Take a look at the tests in tests/test_ocr_metrics.py where I used the example values from BoW Error Rate (OCR-D eval spec).
Please review and give me a sign of what u think.

@M3ssman M3ssman closed this as completed Jun 7, 2023
@M3ssman
Copy link
Member

M3ssman commented Jun 7, 2023

Re-open, since it's up to @bertsky to close or not.

@M3ssman M3ssman reopened this Jun 7, 2023
@bertsky
Copy link
Author

bertsky commented Jun 7, 2023

Sorry, meanwhile I forgot about this. Will revisit and give my two cents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants