Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meeteval ignores false alarms due to issue in md_eval_22 #97

Closed
SamuelBroughton opened this issue Nov 29, 2024 · 6 comments
Closed

meeteval ignores false alarms due to issue in md_eval_22 #97

SamuelBroughton opened this issue Nov 29, 2024 · 6 comments

Comments

@SamuelBroughton
Copy link

This issue is similar / the same as here: nryant/dscore#9

There is an issue in the md_eval_22 pearl script when it is ran without a UEM file, the script tries to generate one from the reference RTTM file only. This can lead to evaluation not taking into consideration any false alarms that occured before the first segment of labeled speech in the reference RTTM.

For example if I have:

# sys.rttm (hypothesis)
SPEAKER rec 1 0.00 10.00 <NA> <NA> spk01 <NA>
SPEAKER rec 1 10.00 10.00 <NA> <NA> spk00 <NA>

and

# ref.rttm (ground truth)
SPEAKER rec 1 5.00 5.00 <NA> <NA> spk01 <NA>
SPEAKER rec 1 10.00 10.00 <NA> <NA> spk00 <NA>

I would expect a DER of 33.3%. Becuase total scored speech is 15 seconds and there is 5 seconds of false alarm predicted by the system at the beginning of the recording.

In reality md_eval_22 outputs 0% DER because the UEM generated on-the-fly would be generated for the reference start and end of speech [5.00, 20.00], which ignores scoring the 5 seconds of false alarm speech the system predicted.

The dscore github repo fixed this by generating a UEM file based on the entire recording before passing a reference RTTM, system RTTM and UEM file to md_eval_22 for evaluation - meaning md_eval_22 does not need to generate a UEM on the fly.

meeteval-der md_eval_22 does not generate this UEM before md_eval_22 evaluation and so we get the incorrect behaviour of not scoring false alarms.

In the same example, with stm style files which meeteval requires:

# hyp.stm (hypothesis)
rec 1 spk01 0.00 10.00
rec 1 spk00 10.00 20.00
# ref.stm (ground truth)
rec 1 spk01 5.00 10.00
rec 1 spk00 10.00 20.00

If I run meeteval-der md_eval_22 -r ref.stm -h sys.stm I get the output:

INFO %DER: 0.00% [ 0.00s / 15.00s, 0.00s missed, 0.00s falarm, 0.00s spk error ]
@boeddeker
Copy link
Member

Hmm. Good point.

The idea was to have a thin wrapper for md_eval_22, as an alternative to call that script directly.
So we wanted no magic or manipulations (except file conversations).

So, when you call with an uem, you get the desired output:

$ cat all.uem 
rec 1 0 20
$ meeteval-der md_eval_22 -r ref.rttm -h sys.rttm --uem all.uem 
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22_per_reco.json
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22.json
INFO %DER: 33.33% [ 5.00s / 15.00s, 0.00s missed, 5.00s falarm, 0.00s spk error ]
$ meeteval-der md_eval_22 -r ref.rttm -h sys.rttm
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22_per_reco.json
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22.json
INFO %DER: 0.00% [ 0.00s / 15.00s, 0.00s missed, 0.00s falarm, 0.00s spk error ]

Do you have an idea, how to change the command line to make it obvious, that it is different from calling the perl script directly?
e.g. something in the direction of meeteval-der md_eval_22_fix -r ref.rttm -h sys.rttm (I don't like this, hence my question)

I think, we should at least add a warning, when someone calls md_eval_22 without an uem.

@boeddeker boeddeker mentioned this issue Nov 29, 2024
1 task
@boeddeker
Copy link
Member

boeddeker commented Nov 29, 2024

I checked the dscore tool and if I get it right, they call md-eval-22.pl to get the DER and the only modification is,
that they write an uem file, if it doesn't exist.

I added some code to calculate the DER in the same way as dscore in #98.

$ meeteval-der md_eval_22 -r ref.rttm -h sys.rttm
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22_per_reco.json
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22.json
INFO %DER: 0.00% [ 0.00s / 15.00s, 0.00s missed, 0.00s falarm, 0.00s spk error ]
$ meeteval-der dscore -r ref.rttm -h sys.rttm
Loading speaker turns from reference RTTMs...
Loading speaker turns from system RTTMs...
WARNING: No universal evaluation map specified. Approximating from reference and speaker turn extents...
Trimming reference speaker turns to UEM scoring regions...
Trimming system speaker turns to UEM scoring regions...
Checking for overlapping reference speaker turns...
Checking for overlapping system speaker turns...
Scoring...
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_dscore_per_reco.json
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_dscore.json
INFO %DER: 33.33% [ 5.00s / 15.00s, 0.00s missed, 5.00s falarm, 0.00s spk error ]

@SamuelBroughton
Copy link
Author

Yeah cool thanks for looking at this and implementing the dscore way of doing things @boeddeker. As you say I think users should still be warned when they are calling md_eval_22 without a uem, and I think this should happen at the api level.

Becuase this is an issue more with the perl script, I wouldn't mind seeing a parameter such as auto_generate_uem at the md_eval_22 api level, which could be default to True? Which behaves similarly to how dscore does things - write the uem first, if and only if the user has not supplied their own uem. Then when False you get the warning.

I understand you wanted to keep it just as a wrapper, I would feel this is an important fix to keep though.

@boeddeker
Copy link
Member

As you say I think users should still be warned when they are calling md_eval_22 without a uem, and I think this should happen at the api level.

The code now emits a warning, when md_eval_22 is called without an uem file.

Becuase this is an issue more with the perl script, I wouldn't mind seeing a parameter such as auto_generate_uem at the md_eval_22 api level, which could be default to True? Which behaves similarly to how dscore does things - write the uem first, if and only if the user has not supplied their own uem. Then when False you get the warning.
I understand you wanted to keep it just as a wrapper, I would feel this is an important fix to keep though.

I am not a fan of changing metric values of a reference implementation, even if there are good reasons.
When it is changed, you have to mention which particular implementation is used, when you report metric values.

When users calls der md_eval_22, they should get md-eval-22.pl with all its pros and cons.
Since dscore is commonly used for the DER calculation, I simply added a new DER command: der dscore.
So now it is easy to switch between the defaults of md-eval-22.pl and dscore.

Additionally, dscore is listed first and in the cli help text it is promoted, that it was used in challenges.

@SamuelBroughton
Copy link
Author

SamuelBroughton commented Dec 18, 2024

@boeddeker I tried to use your updated code and noticed an issue.

CLI works fine. But when you try and call dscore using the high level api it only runs the first time:

In [1]: import meeteval

In [2]: ers = meeteval.der.dscore(
   ...:     reference=meeteval.io.STM.parse('''
   ...:         recordingA 1 Alice 0 1 The quick brown fox jumps over the lazy dog
   ...:         recordingB 1 Bob 0 1 The quick brown fox jumps over the lazy dog
   ...:     '''),
   ...:     hypothesis=meeteval.io.STM.parse('''
   ...:         recordingA 1 spk-1 0 1 The kwick brown fox jump over lazy
   ...:         recordingB 1 spk-1 0 1 The kwick brown fox jump over lazy
   ...:     '''),
   ...:     collar=0,
   ...: )

In [3]: ers = meeteval.der.dscore(
   ...:     reference=meeteval.io.STM.parse('''
   ...:         recordingA 1 Alice 0 1 The quick brown fox jumps over the lazy dog
   ...:         recordingB 1 Bob 0 1 The quick brown fox jumps over the lazy dog
   ...:     '''),
   ...:     hypothesis=meeteval.io.STM.parse('''
   ...:         recordingA 1 spk-1 0 1 The kwick brown fox jump over lazy
   ...:         recordingB 1 spk-1 0 1 The kwick brown fox jump over lazy
   ...:     '''),
   ...:     collar=0,
   ...: )
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 ers = meeteval.der.dscore(
      2     reference=meeteval.io.STM.parse('''
      3         recordingA 1 Alice 0 1 The quick brown fox jumps over the lazy dog
      4         recordingB 1 Bob 0 1 The quick brown fox jumps over the lazy dog
      5     '''),
      6     hypothesis=meeteval.io.STM.parse('''
      7         recordingA 1 spk-1 0 1 The kwick brown fox jump over lazy
      8         recordingB 1 spk-1 0 1 The kwick brown fox jump over lazy
      9     '''),
     10     collar=0,
     11 )

TypeError: 'module' object is not callable

I believe this is due to having meeteval.der.dscore import the function dscore in meeteval.der.api which then imports from the actual dscore module like so from meeteval.der.dscore import dscore_multifile, so now dscore is imported as a module:

In [4]: meeteval.der.dscore.__file__
Out[4]: '/PATH/TO/YOUR/PYTHON/lib/python3.11/site-packages/meeteval/der/dscore.py'

I guess this is why you made the md_eval_22 function import from module md_eval, so you don't have this import issue.

EDIT:
I changed the dscore.py file to another name to test and it works, so can confirm this is the issue. Not sure how you will want to rename your files or modules so will leave another PR to you, thanks!

@boeddeker
Copy link
Member

Thanks for reporting this bug. I frequently forget, that a function name and its file name should differ.

Not sure how you will want to rename your files or modules so will leave another PR to you,

We have no established pattern for such name collisions. I changed the file name to nryant_dscore.py
to give the author nryant the credits for this change of md-eval.

boeddeker added a commit that referenced this issue Dec 18, 2024
rename dscore to nryant_dscore to avoid name collision (#97)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants