From d7ca4a053f8657c00fb70d1aaad825317405fb5e Mon Sep 17 00:00:00 2001 From: Shahruk Hossain Date: Mon, 18 Jul 2022 17:19:21 +0600 Subject: [PATCH] doc: update README example --- README.md | 124 +++++++++++++++++++++++++++++++----------------------- 1 file changed, 72 insertions(+), 52 deletions(-) diff --git a/README.md b/README.md index 8724df5..075461d 100644 --- a/README.md +++ b/README.md @@ -20,9 +20,12 @@ ## Usage Examples -### Evaluating CER +### Evaluating WER -- This example evaluates the character error rate (CER) between reference transcripts and hypothesis transcript generated by a ASR system. The example uses Bengali text but SCTK supports most languages since it expects text with UTF-8 encoding. +- This example evaluates the word error rate (WER) between reference transcripts + and hypothesis transcript generated by a ASR system. The example uses Bengali + text but SCTK supports most languages since it expects text with UTF-8 + encoding. ```sh # Creating dummy reference transcript file in CSV format. @@ -40,96 +43,113 @@ spk02-utt02,খেলা ছার টেস্ট শিরিজের চূ EOF # Getting the sctk CLI tool from this repository and giving it executable permissions. -version=v0.1.0 +version=v0.2.0 wget -O sctk https://github.com/shahruk10/go-sctk/releases/download/${version}/sctk chmod +x sctk -# Using sctk CLI to evaluate CER and check errors. +# Using sctk CLI to evaluate WER and check errors. # # Setting `--ignore-first=true` to ignore header row. # Check `sctk score --help` for documentation of each argument. -sctk score \ +# +# To compare characters instead of words, and calculate the +# character error rate (CER) instead of WER, set --cert=true. +./sctk score \ --ignore-first=true \ --delimiter="," \ --col-id=0 \ --col-trn=1 \ - --case-sensitive=false \ --normalize-unicode=true \ - --cer=true \ - --out=./cer \ + --cer=false \ + --out=./report \ --ref=reference.csv \ --hyp=hypothesis.csv ``` -- Now we can check generated reports in the `./cer` directory. +- Now we can check generated reports in the `./report` directory. ``` - cer/ + report/ ├── hyp1.trn ├── hyp1.trn.dtl - ├── hyp1.trn.pra ├── hyp1.trn.raw ├── hyp1.trn.sgml ├── hyp1.trn.sys + ├── hyp1.trn.pra.html + ├── hyp1.trn.pra.md + ├── hyp1.trn.pra.csv + ├── hyp1.trn.pra.json + ├── hyp1.trn.pra └── ref.trn ``` - The `*.sys` file contains a table showing a breakdown of the different types of errors. - - The results are aggregated for each speaker; `Corr`, `Sub`, `Del` and `Ins` stands for - the percentage of characters that were correctly decoded, substituted, deleted and inserted - in the hypothesis respectively. + - The results are aggregated for each speaker; `Corr`, `Sub`, `Del` and `Ins` + stands for the percentage of words (characters in case of CER) that were + correctly decoded, substituted, deleted and inserted in the hypothesis + respectively. ``` - SYSTEM SUMMARY PERCENTAGES by SPEAKER - - ,----------------------------------------------------------------. - | hyp1 | - |----------------------------------------------------------------| - | SPKR | # Snt # Chr | Corr Sub Del Ins Err S.Err | - |--------+-------------+-----------------------------------------| - | spk01 | 1 25 | 76.0 0.0 24.0 0.0 24.0 100.0 | - |--------+-------------+-----------------------------------------| - | spk02 | 1 33 | 87.9 6.1 6.1 0.0 12.1 100.0 | - |================================================================| - | Sum/Avg| 2 58 | 82.8 3.4 13.8 0.0 17.2 100.0 | - |================================================================| - | Mean | 1.0 29.0 | 81.9 3.0 15.0 0.0 18.1 100.0 | - | S.D. | 0.0 5.7 | 8.4 4.3 12.7 0.0 8.4 0.0 | - | Median | 1.0 29.0 | 81.9 3.0 15.0 0.0 18.1 100.0 | - `----------------------------------------------------------------' + SYSTEM SUMMARY PERCENTAGES by SPEAKER + + ,----------------------------------------------------------------. + | hyp1 | + |----------------------------------------------------------------| + | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | + |--------+-------------+-----------------------------------------| + | spk01 | 1 6 | 50.0 50.0 0.0 0.0 50.0 100.0 | + |--------+-------------+-----------------------------------------| + | spk02 | 1 6 | 50.0 50.0 0.0 0.0 50.0 100.0 | + |================================================================| + | Sum/Avg| 2 12 | 50.0 50.0 0.0 0.0 50.0 100.0 | + |================================================================| + | Mean | 1.0 6.0 | 50.0 50.0 0.0 0.0 50.0 100.0 | + | S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 | + | Median | 1.0 6.0 | 50.0 50.0 0.0 0.0 50.0 100.0 | + `----------------------------------------------------------------' ``` +- The `*.pra.md` and `*.pra.html` file shows alignments between the reference + and hypothesis text in markdown and html format respectively. These alignment + files make it easy to see errors in context. In the table below, taken from + `hyp1.trn.pra.md`, `S` indicates substitutions. `D` and `I` would represent + deletions and insertions respectively. + + | | | | | | | | + |:-----|:------:|:---:|:-----:|:-------:|:--------:|:----:| + | REF | খেলাটি | চার | টেস্ট | সিরিজের | চূড়ান্ত | ছিল। | + | HYP1 | খেলা | ছার | টেস্ট | শিরিজের | চূড়ান্ত | ছিল। | + | EVAL | S | S | | S | | | + +- These alignments are also available in json format in the `*.pra.json` file, + which can be easily loaded into different programs and used for analysis or + combining different ASR results. + +- Further more, multiple ASR systems can be evaluated together by providing more than + one hypothesis with additional uses of the `--hyp` flag when using the `sctk` CLI. + - The `*.dtl` file shows further details of each type of error. This can reveal systematic - errors and patterns in how the ASR system is transcribing the audio. + errors and patterns in how the ASR system is transcribing the audio. When evaluating CER, + this file will show character level information, instead of word level. ``` ... (other useful stuff) - CONFUSION PAIRS Total (2) - With >= 1 occurrences (2) - 1: 1 -> চ ==> ছ - 2: 1 -> স ==> শ - ------- - 2 + CONFUSION PAIRS Total (6) + With >= 1 occurrences (6) + 1: 1 -> ইউরো। ==> ইউর। + 2: 1 -> খেলাটি ==> খেলা + 3: 1 -> চার ==> ছার + 4: 1 -> বার্ষিক ==> বার্ + 5: 1 -> লক্ষ ==> লক + 6: 1 -> সিরিজের ==> শিরিজের + ------- + 6 ... (other useful stuff) - - DELETIONS Total (6) - With >= 1 occurrences (6) - 1: 2 -> ষ - 2: 2 -> ি - 3: 1 -> ক - 4: 1 -> ট - 5: 1 -> ো - 6: 1 -> ্ - ------- - 8 ``` -- The `*.pra` file shows alignments between the reference and hypothesis text, which - makes it easy to see errors in context. - --- ## License