meeteval ignores false alarms due to issue in md_eval_22 #97

SamuelBroughton · 2024-11-29T05:02:59Z

This issue is similar / the same as here: nryant/dscore#9

There is an issue in the md_eval_22 pearl script when it is ran without a UEM file, the script tries to generate one from the reference RTTM file only. This can lead to evaluation not taking into consideration any false alarms that occured before the first segment of labeled speech in the reference RTTM.

For example if I have:

# sys.rttm (hypothesis)
SPEAKER rec 1 0.00 10.00 <NA> <NA> spk01 <NA>
SPEAKER rec 1 10.00 10.00 <NA> <NA> spk00 <NA>

and

# ref.rttm (ground truth)
SPEAKER rec 1 5.00 5.00 <NA> <NA> spk01 <NA>
SPEAKER rec 1 10.00 10.00 <NA> <NA> spk00 <NA>

I would expect a DER of 33.3%. Becuase total scored speech is 15 seconds and there is 5 seconds of false alarm predicted by the system at the beginning of the recording.

In reality md_eval_22 outputs 0% DER because the UEM generated on-the-fly would be generated for the reference start and end of speech [5.00, 20.00], which ignores scoring the 5 seconds of false alarm speech the system predicted.

The dscore github repo fixed this by generating a UEM file based on the entire recording before passing a reference RTTM, system RTTM and UEM file to md_eval_22 for evaluation - meaning md_eval_22 does not need to generate a UEM on the fly.

meeteval-der md_eval_22 does not generate this UEM before md_eval_22 evaluation and so we get the incorrect behaviour of not scoring false alarms.

In the same example, with stm style files which meeteval requires:

# hyp.stm (hypothesis)
rec 1 spk01 0.00 10.00
rec 1 spk00 10.00 20.00

# ref.stm (ground truth)
rec 1 spk01 5.00 10.00
rec 1 spk00 10.00 20.00

If I run meeteval-der md_eval_22 -r ref.stm -h sys.stm I get the output:

INFO %DER: 0.00% [ 0.00s / 15.00s, 0.00s missed, 0.00s falarm, 0.00s spk error ]

The text was updated successfully, but these errors were encountered:

boeddeker · 2024-11-29T10:48:54Z

Hmm. Good point.

The idea was to have a thin wrapper for md_eval_22, as an alternative to call that script directly.
So we wanted no magic or manipulations (except file conversations).

So, when you call with an uem, you get the desired output:

$ cat all.uem 
rec 1 0 20
$ meeteval-der md_eval_22 -r ref.rttm -h sys.rttm --uem all.uem 
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22_per_reco.json
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22.json
INFO %DER: 33.33% [ 5.00s / 15.00s, 0.00s missed, 5.00s falarm, 0.00s spk error ]
$ meeteval-der md_eval_22 -r ref.rttm -h sys.rttm
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22_per_reco.json
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22.json
INFO %DER: 0.00% [ 0.00s / 15.00s, 0.00s missed, 0.00s falarm, 0.00s spk error ]

Do you have an idea, how to change the command line to make it obvious, that it is different from calling the perl script directly?
e.g. something in the direction of meeteval-der md_eval_22_fix -r ref.rttm -h sys.rttm (I don't like this, hence my question)

I think, we should at least add a warning, when someone calls md_eval_22 without an uem.

boeddeker · 2024-11-29T16:47:23Z

I checked the dscore tool and if I get it right, they call md-eval-22.pl to get the DER and the only modification is,
that they write an uem file, if it doesn't exist.

I added some code to calculate the DER in the same way as dscore in #98.

$ meeteval-der md_eval_22 -r ref.rttm -h sys.rttm
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22_per_reco.json
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_md_eval_22.json
INFO %DER: 0.00% [ 0.00s / 15.00s, 0.00s missed, 0.00s falarm, 0.00s spk error ]

$ meeteval-der dscore -r ref.rttm -h sys.rttm
Loading speaker turns from reference RTTMs...
Loading speaker turns from system RTTMs...
WARNING: No universal evaluation map specified. Approximating from reference and speaker turn extents...
Trimming reference speaker turns to UEM scoring regions...
Trimming system speaker turns to UEM scoring regions...
Checking for overlapping reference speaker turns...
Checking for overlapping system speaker turns...
Scoring...
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_dscore_per_reco.json
INFO Wrote: /home/cbj/python/meeteval/tmp/mdeval/sys_dscore.json
INFO %DER: 33.33% [ 5.00s / 15.00s, 0.00s missed, 5.00s falarm, 0.00s spk error ]

SamuelBroughton · 2024-12-02T01:29:04Z

Yeah cool thanks for looking at this and implementing the dscore way of doing things @boeddeker. As you say I think users should still be warned when they are calling md_eval_22 without a uem, and I think this should happen at the api level.

Becuase this is an issue more with the perl script, I wouldn't mind seeing a parameter such as auto_generate_uem at the md_eval_22 api level, which could be default to True? Which behaves similarly to how dscore does things - write the uem first, if and only if the user has not supplied their own uem. Then when False you get the warning.

I understand you wanted to keep it just as a wrapper, I would feel this is an important fix to keep though.

boeddeker · 2024-12-10T12:52:03Z

As you say I think users should still be warned when they are calling md_eval_22 without a uem, and I think this should happen at the api level.

The code now emits a warning, when md_eval_22 is called without an uem file.

Becuase this is an issue more with the perl script, I wouldn't mind seeing a parameter such as auto_generate_uem at the md_eval_22 api level, which could be default to True? Which behaves similarly to how dscore does things - write the uem first, if and only if the user has not supplied their own uem. Then when False you get the warning.
I understand you wanted to keep it just as a wrapper, I would feel this is an important fix to keep though.

I am not a fan of changing metric values of a reference implementation, even if there are good reasons.
When it is changed, you have to mention which particular implementation is used, when you report metric values.

When users calls der md_eval_22, they should get md-eval-22.pl with all its pros and cons.
Since dscore is commonly used for the DER calculation, I simply added a new DER command: der dscore.
So now it is easy to switch between the defaults of md-eval-22.pl and dscore.

Additionally, dscore is listed first and in the cli help text it is promoted, that it was used in challenges.

SamuelBroughton · 2024-12-18T03:43:52Z

@boeddeker I tried to use your updated code and noticed an issue.

CLI works fine. But when you try and call dscore using the high level api it only runs the first time:

In [1]: import meeteval

In [2]: ers = meeteval.der.dscore(
   ...:     reference=meeteval.io.STM.parse('''
   ...:         recordingA 1 Alice 0 1 The quick brown fox jumps over the lazy dog
   ...:         recordingB 1 Bob 0 1 The quick brown fox jumps over the lazy dog
   ...:     '''),
   ...:     hypothesis=meeteval.io.STM.parse('''
   ...:         recordingA 1 spk-1 0 1 The kwick brown fox jump over lazy
   ...:         recordingB 1 spk-1 0 1 The kwick brown fox jump over lazy
   ...:     '''),
   ...:     collar=0,
   ...: )

In [3]: ers = meeteval.der.dscore(
   ...:     reference=meeteval.io.STM.parse('''
   ...:         recordingA 1 Alice 0 1 The quick brown fox jumps over the lazy dog
   ...:         recordingB 1 Bob 0 1 The quick brown fox jumps over the lazy dog
   ...:     '''),
   ...:     hypothesis=meeteval.io.STM.parse('''
   ...:         recordingA 1 spk-1 0 1 The kwick brown fox jump over lazy
   ...:         recordingB 1 spk-1 0 1 The kwick brown fox jump over lazy
   ...:     '''),
   ...:     collar=0,
   ...: )
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 ers = meeteval.der.dscore(
      2     reference=meeteval.io.STM.parse('''
      3         recordingA 1 Alice 0 1 The quick brown fox jumps over the lazy dog
      4         recordingB 1 Bob 0 1 The quick brown fox jumps over the lazy dog
      5     '''),
      6     hypothesis=meeteval.io.STM.parse('''
      7         recordingA 1 spk-1 0 1 The kwick brown fox jump over lazy
      8         recordingB 1 spk-1 0 1 The kwick brown fox jump over lazy
      9     '''),
     10     collar=0,
     11 )

TypeError: 'module' object is not callable

I believe this is due to having meeteval.der.dscore import the function dscore in meeteval.der.api which then imports from the actual dscore module like so from meeteval.der.dscore import dscore_multifile, so now dscore is imported as a module:

In [4]: meeteval.der.dscore.__file__
Out[4]: '/PATH/TO/YOUR/PYTHON/lib/python3.11/site-packages/meeteval/der/dscore.py'

I guess this is why you made the md_eval_22 function import from module md_eval, so you don't have this import issue.

EDIT:
I changed the dscore.py file to another name to test and it works, so can confirm this is the issue. Not sure how you will want to rename your files or modules so will leave another PR to you, thanks!

boeddeker · 2024-12-18T11:39:10Z

Thanks for reporting this bug. I frequently forget, that a function name and its file name should differ.

Not sure how you will want to rename your files or modules so will leave another PR to you,

We have no established pattern for such name collisions. I changed the file name to nryant_dscore.py
to give the author nryant the credits for this change of md-eval.

rename dscore to nryant_dscore to avoid name collision (#97)

boeddeker mentioned this issue Nov 29, 2024

Add dscore like DER #98

Merged

1 task

boeddeker added a commit to boeddeker/meeteval that referenced this issue Dec 18, 2024

rename dscore to nryant_dscore to avoid name collision (fgnt#97)

893dd15

boeddeker mentioned this issue Dec 18, 2024

rename dscore to nryant_dscore to avoid name collision (#97) #99

Merged

boeddeker added a commit that referenced this issue Dec 18, 2024

Merge pull request #99 from boeddeker/main

35e90de

rename dscore to nryant_dscore to avoid name collision (#97)

SamuelBroughton closed this as completed Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

meeteval ignores false alarms due to issue in md_eval_22 #97

meeteval ignores false alarms due to issue in md_eval_22 #97

SamuelBroughton commented Nov 29, 2024

boeddeker commented Nov 29, 2024

boeddeker commented Nov 29, 2024 •

edited

Loading

SamuelBroughton commented Dec 2, 2024

boeddeker commented Dec 10, 2024

SamuelBroughton commented Dec 18, 2024 •

edited

Loading

boeddeker commented Dec 18, 2024

meeteval ignores false alarms due to issue in md_eval_22 #97

meeteval ignores false alarms due to issue in md_eval_22 #97

Comments

SamuelBroughton commented Nov 29, 2024

boeddeker commented Nov 29, 2024

boeddeker commented Nov 29, 2024 • edited Loading

SamuelBroughton commented Dec 2, 2024

boeddeker commented Dec 10, 2024

SamuelBroughton commented Dec 18, 2024 • edited Loading

boeddeker commented Dec 18, 2024

boeddeker commented Nov 29, 2024 •

edited

Loading

SamuelBroughton commented Dec 18, 2024 •

edited

Loading