Adjusting documentation and bumping version

revdotcom · Apr 30, 2021 · 702b13e · 702b13e
1 parent 2b0b127
commit 702b13e
Show file tree

Hide file tree

Showing 4 changed files with 165 additions and 167 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -45,7 +45,7 @@ if(DYNAMIC_OPENFST)
   )
 else()
   set(OPENFST_LIBRARIES
-    ${OPENFST_ROOT}/lib/libfst.a
+    ${OPENFST_ROOT}/lib/libfst.a -ldl
   )
 endif()
 

diff --git a/README.md b/README.md
@@ -10,18 +10,16 @@
 - [Quickstart](#Quickstart)
   * [WER Subcommand](#WER-Subcommand)
   * [Align Subcommand](#Align-Subcommand)
-- [Inputs](#Inputs)
-- [Outputs](#Outputs)
+- [Advanced Usage](#Advanced-Usage)
 
 ## Overview
 `fstalign` is a tool for creating alignment between two sequences of tokens (here out referred to as “reference” and “hypothesis”). It has two key functions: computing word error rate (WER) and aligning [NLP-formatted](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md) references with CTM hypotheses.
 
-Due to its use of OpenFST and lazy algorithms for text-based edit-distance alignment, `fstalign` is one of the fastest and most efficient tools for calculating WER. Furthermore, the tool offers additional features to augment error analysis, which will be covered more in depth below.
+Due to its use of OpenFST and lazy algorithms for text-based alignment, `fstalign` is efficient for calculating WER while also providing significant flexibility for different measurement features and error analysis.
 
 ## Installation
 
 ### Dependencies
-
 We use git submodules to manage third-party dependencies. Initialize and update submodules before proceeding to the main build steps.
 ```
 git submodule update --init --recursive
@@ -39,7 +37,6 @@ Additionally, we have dependencies outside of the third-party submodules:
 - OpenFST - currently provided to the build system by settings the $OPENFST_ROOT environment variable or during the CMake command via `-DOPENFST_ROOT`.
 
 ### Build
-
 The current build framework is CMake. Install CMake following the instructions here (https://cmake.org/install/).
 
 To build fstalign, run:
@@ -119,170 +116,12 @@ When run, fstalign will dump a log to STDOUT with summary WER information at the
 [+++] [20:37:10] [wer] best WER: 2/5 = 0.4000 (Total words in reference: 5)
 [+++] [20:37:10] [wer] best WER: INS:0 DEL:0 SUB:2
 [+++] [20:37:10] [wer] best WER: Precision:0.600000 Recall:0.600000
-[+++] [20:37:10] [console] done
 ```
 
 Note that in addition to general WER, the insertion/deletion/substitution breakdown is also printed. fstalign also has other useful outputs, including a JSON log for downstream machine parsing, and a side-by-side view of the alignment and errors generated. For more details, see the [Output](#Output) section below.
 
-Much of the advanced usage and features for fstalign come from providing [NLP file inputs](#NLP) to the references. Some of these features include:
-- Entity category WER and normalization: based on labels in the NLP file, entities are grouped into classes in the WER output
-  - For example: if the NLP has `2020|0||||CA|['0:YEAR']|` you will see
-```s
-[+++] [22:36:50] [approach1] class YEAR         WER: 0/8 = 0.0000
-```
-
-  - Another useful feature here is normalization, which allows tokens with entity labels to have multiple normalizations accepted as correct by fstalign. This functionality is enabled when the tool is invoked with `--ref-json <path_to_norm_sidecar>` (passed in addition to the `--ref`). This enables something like `2020` to be treated equivalent to `twenty twenty`. More details on the specification for this file are specified in the [Inputs](#Inputs) section below. Note that only reference-side normalization is currently supported.
-
-- Speaker-wise WER: since the NLP file contains a speaker column, fstalign logs and output will provide a breakdown of WER by speaker ID if non-null
-
-- Speaker-switch WER: similarly, fstalign will report the error rate of words around a speaker switch
-  - The window size for the context of a speaker switch can be adjusted with the `--speaker-switch-context <int>` flag. By default this is set to 5.
-
-
 ### Align Subcommand
 Usage of the `align` subcommand is almost identical to the `wer` subcommand. The exception is that `align` can only be run if the provided reference is a NLP and the provided hypothesis is a CTM. This is because the core function of the subcommand is to align an NLP without timestamps to a CTM that has timestamps, producing an output of tokens from the reference with timings from the hypothesis.
 
-## Inputs
-### CTM
-Time-marked conversations (CTM) are typical outputs for ASR systems. The format of CTMs that fstalign assumes is that each token is on a new line separated by spaces with the following fields.
-```
-<recording_id> <channel_id> <token_start_ts> <token_end_ts> <token_value>
-```
-Moreover, there is an optional sixth field `<confidence_score>` that is read in if provided. The field does not affect the WER calculation and is primarily there just to support the parsing the common alteration to the basic CTM format.
-
-Example (no confidence scores):
-```
-test.wav 1 1.0 1.0 a
-test.wav 1 3.0 1.0 b
-test.wav 1 5.0 1.0 c
-test.wav 1 7.0 1.0 d
-test.wav 1 9.0 1.0 <unk>
-test.wav 1 11.0 1.0 e
-test.wav 1 13.0 1.0 f
-test.wav 1 15.0 1.0 g
-```
-
-### NLP
-[NLP Format](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md)
-
-### FST
-OpenFST FST files can only be passed to the `--hyp` parameter. fstalign will directly use this FST as the hypothesis during alignment. This is useful for something like oracle lattice analysis, where the reference is aligned to the most accurate path present in a lattice.
-
-### Synonyms
-Synonyms allow for reference words to be equivalent to similar forms (determined by the user) for error counting. They are accepted for any input formats and passed into the tool via the `--syn <path_to_synonym_file>` flag. For details see [Synonyms Format](https://github.com/revdotcom/fstalign/blob/develop/docs/Synonyms-Format.md). A standard set of synonyms we use at Rev.ai is available in the repository under `sample_data/synonyms.rules.txt`.
-
-### Normalizations
-Normalizations are a similar concept to synonyms. They allow a token or group of tokens to be represented by alternatives when calculating the WER alignment. Unlike synonyms, they are only accepted for NLP file inputs where the tokens are tagged with a unique ID. The normalizations are specified in a JSON format, with the unique ID as keys. Example to illustrate the schema:
-```
-{
-    "0": {
-        "candidates": [
-            {
-                "probability": 0.5,   // Optional and currently unused field
-                "verbalization": [
-                    "twenty",
-                    "twenty"
-                ]
-            },
-            {
-                "probability": 0.5,
-                "verbalization": [
-                    "two",
-                    "thousand",
-                    "and",
-                    "twenty"
-                ]
-            }
-        ],
-        "class": "YEAR"
-    }
-}
-```
-
-## Outputs
-
-### Text Log
-CLI flag: `--log`
-
-Saves stdout messages to a log file.
-
-### SBS
-CLI flag: `--output-sbs`
-
-Writes a side-by-side alignment of the reference and hypothesis to a file. Useful for debugging and error analysis.
-
-Example:
-```
-           ref_token    hyp_token               IsErr   Class
-                   i    i                               
-                 was    was                             
-                just    just
-               going    going                           
-                  to    to                              
-                 say    say                             
-                 one    one
-               thing    thing
-               <ins>    me                      ERR
-                 i'm    i                       ERR     ___0_CONTRACTION___
-              really    really
-        appreciating    appreciated             ERR
-```
-
-In this example, "i'm" was labeled as `___0_CONTRACTION___` in the reference, so the error will be added when computing the WER specific for `CONTRACTION` entities.
-
-### JSON Log
-CLI flag: `--json-log`
-
-Writes all WER statistics and precision/recall information to a machine-parseable JSON file.
-
-Schema: [json_log_schema.json](https://github.com/revdotcom/fstalign/blob/develop/docs/json_log_schema.json)
-
-Example snippet:
-```
-{
-        "wer" : 
-        {
-                "bestWER" :
-                {
-                        "deletions" : 93,
-                        "insertions" : 47,
-                        "meta" : {},
-                        "numErrors" : 228,
-                        "numWordsInReference" : 1312,
-                        "precision" : 0.89336490631103516,
-                        "recall" : 0.86204266548156738,
-                        "substitutions" : 88,
-                        "wer" : 0.17378048598766327
-                },
-                "classWER" :
-                {
-                        "CARDINAL" :
-                        {
-                                "deletions" : 0,
-                                "insertions" : 0,
-                                "meta" : {},
-                                "numErrors" : 0,
-                                "numWordsInReference" : 7,
-                                "substitutions" : 0,
-                                "wer" : 0.0
-                        }
-                },
-                "bigrams" :
-                {
-                        "amount of" :
-                        {
-                                "correct" : 0,
-                                "deletions" : 1,
-                                "insertions" : 0,
-                                "precision" : 0.0,
-                                "recall" : 0.0,
-                                "substitutions" : 0
-                        },
-```
-The “bigrams” and “unigrams” fields are only populated with unigrams and bigrams that surpass the minimum frequency specified by the `--pr_threshold` flag, which is set to 0 by default.
-
-### NLP
-
-CLI flag: `--output-nlp`
-
-Writes out the reference [NLP](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md), but with timings provided by a hypothesis CTM. Mostly relevant for the `align` subcommand.
+## Advanced Usage
+See [the advanced usage doc](https://github.com/revdotcom/fstalign/blob/develop/docs/Advanced-Usage.md) for more details.
diff --git a/docs/Advanced-Usage.md b/docs/Advanced-Usage.md
@@ -0,0 +1,159 @@
+## Advanced Usage
+Much of the advanced features for fstalign come from providing [NLP file inputs](#NLP) to the references. Some of these features include:
+- Entity category WER and normalization: based on labels in the NLP file, entities are grouped into classes in the WER output
+  - For example: if the NLP has `2020|0||||CA|['0:YEAR']|` you will see
+```s
+[+++] [22:36:50] [approach1] class YEAR         WER: 0/8 = 0.0000
+```
+
+  - Another useful feature here is normalization, which allows tokens with entity labels to have multiple normalizations accepted as correct by fstalign. This functionality is enabled when the tool is invoked with `--ref-json <path_to_norm_sidecar>` (passed in addition to the `--ref`). This enables something like `2020` to be treated equivalent to `twenty twenty`. More details on the specification for this file are specified in the [Inputs](#Inputs) section below. Note that only reference-side normalization is currently supported.
+
+- Speaker-wise WER: since the NLP file contains a speaker column, fstalign logs and output will provide a breakdown of WER by speaker ID if non-null
+
+- Speaker-switch WER: similarly, fstalign will report the error rate of words around a speaker switch
+  - The window size for the context of a speaker switch can be adjusted with the `--speaker-switch-context <int>` flag. By default this is set to 5.
+
+## Inputs
+### CTM
+Time-marked conversations (CTM) are typical outputs for ASR systems. The format of CTMs that fstalign assumes is that each token is on a new line separated by spaces with the following fields.
+```
+<recording_id> <channel_id> <token_start_ts> <token_end_ts> <token_value>
+```
+Moreover, there is an optional sixth field `<confidence_score>` that is read in if provided. The field does not affect the WER calculation and is primarily there just to support the parsing the common alteration to the basic CTM format.
+
+Example (no confidence scores):
+```
+test.wav 1 1.0 1.0 a
+test.wav 1 3.0 1.0 b
+test.wav 1 5.0 1.0 c
+test.wav 1 7.0 1.0 d
+test.wav 1 9.0 1.0 <unk>
+test.wav 1 11.0 1.0 e
+test.wav 1 13.0 1.0 f
+test.wav 1 15.0 1.0 g
+```
+
+### NLP
+[NLP Format](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md)
+
+### FST
+OpenFST FST files can only be passed to the `--hyp` parameter. fstalign will directly use this FST as the hypothesis during alignment. This is useful for something like oracle lattice analysis, where the reference is aligned to the most accurate path present in a lattice.
+
+### Synonyms
+Synonyms allow for reference words to be equivalent to similar forms (determined by the user) for error counting. They are accepted for any input formats and passed into the tool via the `--syn <path_to_synonym_file>` flag. For details see [Synonyms Format](https://github.com/revdotcom/fstalign/blob/develop/docs/Synonyms-Format.md). A standard set of synonyms we use at Rev.ai is available in the repository under `sample_data/synonyms.rules.txt`.
+
+### Normalizations
+Normalizations are a similar concept to synonyms. They allow a token or group of tokens to be represented by alternatives when calculating the WER alignment. Unlike synonyms, they are only accepted for NLP file inputs where the tokens are tagged with a unique ID. The normalizations are specified in a JSON format, with the unique ID as keys. Example to illustrate the schema:
+```
+{
+    "0": {
+        "candidates": [
+            {
+                "probability": 0.5,   // Optional and currently unused field
+                "verbalization": [
+                    "twenty",
+                    "twenty"
+                ]
+            },
+            {
+                "probability": 0.5,
+                "verbalization": [
+                    "two",
+                    "thousand",
+                    "and",
+                    "twenty"
+                ]
+            }
+        ],
+        "class": "YEAR"
+    }
+}
+```
+
+## Outputs
+
+### Text Log
+CLI flag: `--log`
+
+Saves stdout messages to a log file.
+
+### SBS
+CLI flag: `--output-sbs`
+
+Writes a side-by-side alignment of the reference and hypothesis to a file. Useful for debugging and error analysis.
+
+Example:
+```
+           ref_token    hyp_token               IsErr   Class
+                   i    i                               
+                 was    was                             
+                just    just
+               going    going                           
+                  to    to                              
+                 say    say                             
+                 one    one
+               thing    thing
+               <ins>    me                      ERR
+                 i'm    i                       ERR     ___0_CONTRACTION___
+              really    really
+        appreciating    appreciated             ERR
+```
+
+In this example, "i'm" was labeled as `___0_CONTRACTION___` in the reference, so the error will be added when computing the WER specific for `CONTRACTION` entities.
+
+### JSON Log
+CLI flag: `--json-log`
+
+Writes all WER statistics and precision/recall information to a machine-parseable JSON file.
+
+Schema: [json_log_schema.json](https://github.com/revdotcom/fstalign/blob/develop/docs/json_log_schema.json)
+
+Example snippet:
+```
+{
+        "wer" : 
+        {
+                "bestWER" :
+                {
+                        "deletions" : 93,
+                        "insertions" : 47,
+                        "meta" : {},
+                        "numErrors" : 228,
+                        "numWordsInReference" : 1312,
+                        "precision" : 0.89336490631103516,
+                        "recall" : 0.86204266548156738,
+                        "substitutions" : 88,
+                        "wer" : 0.17378048598766327
+                },
+                "classWER" :
+                {
+                        "CARDINAL" :
+                        {
+                                "deletions" : 0,
+                                "insertions" : 0,
+                                "meta" : {},
+                                "numErrors" : 0,
+                                "numWordsInReference" : 7,
+                                "substitutions" : 0,
+                                "wer" : 0.0
+                        }
+                },
+                "bigrams" :
+                {
+                        "amount of" :
+                        {
+                                "correct" : 0,
+                                "deletions" : 1,
+                                "insertions" : 0,
+                                "precision" : 0.0,
+                                "recall" : 0.0,
+                                "substitutions" : 0
+                        },
+```
+The “bigrams” and “unigrams” fields are only populated with unigrams and bigrams that surpass the minimum frequency specified by the `--pr_threshold` flag, which is set to 0 by default.
+
+### NLP
+
+CLI flag: `--output-nlp`
+
+Writes out the reference [NLP](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md), but with timings provided by a hypothesis CTM. Mostly relevant for the `align` subcommand.
diff --git a/src/version.h b/src/version.h
@@ -1,5 +1,5 @@
 #pragma once
 
 #define FSTALIGNER_VERSION_MAJOR 1
-#define FSTALIGNER_VERSION_MINOR 0
+#define FSTALIGNER_VERSION_MINOR 1
 #define FSTALIGNER_VERSION_PATCH 0
-Original file line number
+Diff line change
@@ Expand Up / @@ -45,7 +45,7 @@ if(DYNAMIC_OPENFST) @@
       )
     else()
       set(OPENFST_LIBRARIES
-        ${OPENFST_ROOT}/lib/libfst.a
+        ${OPENFST_ROOT}/lib/libfst.a -ldl
       )
     endif()
@@ Expand Down @@