Skip to content

Commit

Permalink
Adjusting documentation and bumping version
Browse files Browse the repository at this point in the history
  • Loading branch information
qmac committed Apr 30, 2021
1 parent 2b0b127 commit 702b13e
Show file tree
Hide file tree
Showing 4 changed files with 165 additions and 167 deletions.
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ if(DYNAMIC_OPENFST)
)
else()
set(OPENFST_LIBRARIES
${OPENFST_ROOT}/lib/libfst.a
${OPENFST_ROOT}/lib/libfst.a -ldl
)
endif()

Expand Down
169 changes: 4 additions & 165 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,16 @@
- [Quickstart](#Quickstart)
* [WER Subcommand](#WER-Subcommand)
* [Align Subcommand](#Align-Subcommand)
- [Inputs](#Inputs)
- [Outputs](#Outputs)
- [Advanced Usage](#Advanced-Usage)

## Overview
`fstalign` is a tool for creating alignment between two sequences of tokens (here out referred to as “reference” and “hypothesis”). It has two key functions: computing word error rate (WER) and aligning [NLP-formatted](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md) references with CTM hypotheses.

Due to its use of OpenFST and lazy algorithms for text-based edit-distance alignment, `fstalign` is one of the fastest and most efficient tools for calculating WER. Furthermore, the tool offers additional features to augment error analysis, which will be covered more in depth below.
Due to its use of OpenFST and lazy algorithms for text-based alignment, `fstalign` is efficient for calculating WER while also providing significant flexibility for different measurement features and error analysis.

## Installation

### Dependencies

We use git submodules to manage third-party dependencies. Initialize and update submodules before proceeding to the main build steps.
```
git submodule update --init --recursive
Expand All @@ -39,7 +37,6 @@ Additionally, we have dependencies outside of the third-party submodules:
- OpenFST - currently provided to the build system by settings the $OPENFST_ROOT environment variable or during the CMake command via `-DOPENFST_ROOT`.

### Build

The current build framework is CMake. Install CMake following the instructions here (https://cmake.org/install/).

To build fstalign, run:
Expand Down Expand Up @@ -119,170 +116,12 @@ When run, fstalign will dump a log to STDOUT with summary WER information at the
[+++] [20:37:10] [wer] best WER: 2/5 = 0.4000 (Total words in reference: 5)
[+++] [20:37:10] [wer] best WER: INS:0 DEL:0 SUB:2
[+++] [20:37:10] [wer] best WER: Precision:0.600000 Recall:0.600000
[+++] [20:37:10] [console] done
```

Note that in addition to general WER, the insertion/deletion/substitution breakdown is also printed. fstalign also has other useful outputs, including a JSON log for downstream machine parsing, and a side-by-side view of the alignment and errors generated. For more details, see the [Output](#Output) section below.

Much of the advanced usage and features for fstalign come from providing [NLP file inputs](#NLP) to the references. Some of these features include:
- Entity category WER and normalization: based on labels in the NLP file, entities are grouped into classes in the WER output
- For example: if the NLP has `2020|0||||CA|['0:YEAR']|` you will see
```s
[+++] [22:36:50] [approach1] class YEAR         WER: 0/8 = 0.0000
```

- Another useful feature here is normalization, which allows tokens with entity labels to have multiple normalizations accepted as correct by fstalign. This functionality is enabled when the tool is invoked with `--ref-json <path_to_norm_sidecar>` (passed in addition to the `--ref`). This enables something like `2020` to be treated equivalent to `twenty twenty`. More details on the specification for this file are specified in the [Inputs](#Inputs) section below. Note that only reference-side normalization is currently supported.

- Speaker-wise WER: since the NLP file contains a speaker column, fstalign logs and output will provide a breakdown of WER by speaker ID if non-null

- Speaker-switch WER: similarly, fstalign will report the error rate of words around a speaker switch
- The window size for the context of a speaker switch can be adjusted with the `--speaker-switch-context <int>` flag. By default this is set to 5.


### Align Subcommand
Usage of the `align` subcommand is almost identical to the `wer` subcommand. The exception is that `align` can only be run if the provided reference is a NLP and the provided hypothesis is a CTM. This is because the core function of the subcommand is to align an NLP without timestamps to a CTM that has timestamps, producing an output of tokens from the reference with timings from the hypothesis.

## Inputs
### CTM
Time-marked conversations (CTM) are typical outputs for ASR systems. The format of CTMs that fstalign assumes is that each token is on a new line separated by spaces with the following fields.
```
<recording_id> <channel_id> <token_start_ts> <token_end_ts> <token_value>
```
Moreover, there is an optional sixth field `<confidence_score>` that is read in if provided. The field does not affect the WER calculation and is primarily there just to support the parsing the common alteration to the basic CTM format.

Example (no confidence scores):
```
test.wav 1 1.0 1.0 a
test.wav 1 3.0 1.0 b
test.wav 1 5.0 1.0 c
test.wav 1 7.0 1.0 d
test.wav 1 9.0 1.0 <unk>
test.wav 1 11.0 1.0 e
test.wav 1 13.0 1.0 f
test.wav 1 15.0 1.0 g
```

### NLP
[NLP Format](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md)

### FST
OpenFST FST files can only be passed to the `--hyp` parameter. fstalign will directly use this FST as the hypothesis during alignment. This is useful for something like oracle lattice analysis, where the reference is aligned to the most accurate path present in a lattice.

### Synonyms
Synonyms allow for reference words to be equivalent to similar forms (determined by the user) for error counting. They are accepted for any input formats and passed into the tool via the `--syn <path_to_synonym_file>` flag. For details see [Synonyms Format](https://github.com/revdotcom/fstalign/blob/develop/docs/Synonyms-Format.md). A standard set of synonyms we use at Rev.ai is available in the repository under `sample_data/synonyms.rules.txt`.

### Normalizations
Normalizations are a similar concept to synonyms. They allow a token or group of tokens to be represented by alternatives when calculating the WER alignment. Unlike synonyms, they are only accepted for NLP file inputs where the tokens are tagged with a unique ID. The normalizations are specified in a JSON format, with the unique ID as keys. Example to illustrate the schema:
```
{
"0": {
"candidates": [
{
"probability": 0.5, // Optional and currently unused field
"verbalization": [
"twenty",
"twenty"
]
},
{
"probability": 0.5,
"verbalization": [
"two",
"thousand",
"and",
"twenty"
]
}
],
"class": "YEAR"
}
}
```

## Outputs

### Text Log
CLI flag: `--log`

Saves stdout messages to a log file.

### SBS
CLI flag: `--output-sbs`

Writes a side-by-side alignment of the reference and hypothesis to a file. Useful for debugging and error analysis.

Example:
```
ref_token hyp_token IsErr Class
i i
was was
just just
going going
to to
say say
one one
thing thing
<ins> me ERR
i'm i ERR ___0_CONTRACTION___
really really
appreciating appreciated ERR
```

In this example, "i'm" was labeled as `___0_CONTRACTION___` in the reference, so the error will be added when computing the WER specific for `CONTRACTION` entities.

### JSON Log
CLI flag: `--json-log`

Writes all WER statistics and precision/recall information to a machine-parseable JSON file.

Schema: [json_log_schema.json](https://github.com/revdotcom/fstalign/blob/develop/docs/json_log_schema.json)

Example snippet:
```
{
"wer" :
{
"bestWER" :
{
"deletions" : 93,
"insertions" : 47,
"meta" : {},
"numErrors" : 228,
"numWordsInReference" : 1312,
"precision" : 0.89336490631103516,
"recall" : 0.86204266548156738,
"substitutions" : 88,
"wer" : 0.17378048598766327
},
"classWER" :
{
"CARDINAL" :
{
"deletions" : 0,
"insertions" : 0,
"meta" : {},
"numErrors" : 0,
"numWordsInReference" : 7,
"substitutions" : 0,
"wer" : 0.0
}
},
"bigrams" :
{
"amount of" :
{
"correct" : 0,
"deletions" : 1,
"insertions" : 0,
"precision" : 0.0,
"recall" : 0.0,
"substitutions" : 0
},
```
The “bigrams” and “unigrams” fields are only populated with unigrams and bigrams that surpass the minimum frequency specified by the `--pr_threshold` flag, which is set to 0 by default.

### NLP

CLI flag: `--output-nlp`

Writes out the reference [NLP](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md), but with timings provided by a hypothesis CTM. Mostly relevant for the `align` subcommand.
## Advanced Usage
See [the advanced usage doc](https://github.com/revdotcom/fstalign/blob/develop/docs/Advanced-Usage.md) for more details.
159 changes: 159 additions & 0 deletions docs/Advanced-Usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
## Advanced Usage
Much of the advanced features for fstalign come from providing [NLP file inputs](#NLP) to the references. Some of these features include:
- Entity category WER and normalization: based on labels in the NLP file, entities are grouped into classes in the WER output
- For example: if the NLP has `2020|0||||CA|['0:YEAR']|` you will see
```s
[+++] [22:36:50] [approach1] class YEAR         WER: 0/8 = 0.0000
```

- Another useful feature here is normalization, which allows tokens with entity labels to have multiple normalizations accepted as correct by fstalign. This functionality is enabled when the tool is invoked with `--ref-json <path_to_norm_sidecar>` (passed in addition to the `--ref`). This enables something like `2020` to be treated equivalent to `twenty twenty`. More details on the specification for this file are specified in the [Inputs](#Inputs) section below. Note that only reference-side normalization is currently supported.

- Speaker-wise WER: since the NLP file contains a speaker column, fstalign logs and output will provide a breakdown of WER by speaker ID if non-null

- Speaker-switch WER: similarly, fstalign will report the error rate of words around a speaker switch
- The window size for the context of a speaker switch can be adjusted with the `--speaker-switch-context <int>` flag. By default this is set to 5.

## Inputs
### CTM
Time-marked conversations (CTM) are typical outputs for ASR systems. The format of CTMs that fstalign assumes is that each token is on a new line separated by spaces with the following fields.
```
<recording_id> <channel_id> <token_start_ts> <token_end_ts> <token_value>
```
Moreover, there is an optional sixth field `<confidence_score>` that is read in if provided. The field does not affect the WER calculation and is primarily there just to support the parsing the common alteration to the basic CTM format.

Example (no confidence scores):
```
test.wav 1 1.0 1.0 a
test.wav 1 3.0 1.0 b
test.wav 1 5.0 1.0 c
test.wav 1 7.0 1.0 d
test.wav 1 9.0 1.0 <unk>
test.wav 1 11.0 1.0 e
test.wav 1 13.0 1.0 f
test.wav 1 15.0 1.0 g
```

### NLP
[NLP Format](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md)

### FST
OpenFST FST files can only be passed to the `--hyp` parameter. fstalign will directly use this FST as the hypothesis during alignment. This is useful for something like oracle lattice analysis, where the reference is aligned to the most accurate path present in a lattice.

### Synonyms
Synonyms allow for reference words to be equivalent to similar forms (determined by the user) for error counting. They are accepted for any input formats and passed into the tool via the `--syn <path_to_synonym_file>` flag. For details see [Synonyms Format](https://github.com/revdotcom/fstalign/blob/develop/docs/Synonyms-Format.md). A standard set of synonyms we use at Rev.ai is available in the repository under `sample_data/synonyms.rules.txt`.

### Normalizations
Normalizations are a similar concept to synonyms. They allow a token or group of tokens to be represented by alternatives when calculating the WER alignment. Unlike synonyms, they are only accepted for NLP file inputs where the tokens are tagged with a unique ID. The normalizations are specified in a JSON format, with the unique ID as keys. Example to illustrate the schema:
```
{
"0": {
"candidates": [
{
"probability": 0.5, // Optional and currently unused field
"verbalization": [
"twenty",
"twenty"
]
},
{
"probability": 0.5,
"verbalization": [
"two",
"thousand",
"and",
"twenty"
]
}
],
"class": "YEAR"
}
}
```

## Outputs

### Text Log
CLI flag: `--log`

Saves stdout messages to a log file.

### SBS
CLI flag: `--output-sbs`

Writes a side-by-side alignment of the reference and hypothesis to a file. Useful for debugging and error analysis.

Example:
```
ref_token hyp_token IsErr Class
i i
was was
just just
going going
to to
say say
one one
thing thing
<ins> me ERR
i'm i ERR ___0_CONTRACTION___
really really
appreciating appreciated ERR
```

In this example, "i'm" was labeled as `___0_CONTRACTION___` in the reference, so the error will be added when computing the WER specific for `CONTRACTION` entities.

### JSON Log
CLI flag: `--json-log`

Writes all WER statistics and precision/recall information to a machine-parseable JSON file.

Schema: [json_log_schema.json](https://github.com/revdotcom/fstalign/blob/develop/docs/json_log_schema.json)

Example snippet:
```
{
"wer" :
{
"bestWER" :
{
"deletions" : 93,
"insertions" : 47,
"meta" : {},
"numErrors" : 228,
"numWordsInReference" : 1312,
"precision" : 0.89336490631103516,
"recall" : 0.86204266548156738,
"substitutions" : 88,
"wer" : 0.17378048598766327
},
"classWER" :
{
"CARDINAL" :
{
"deletions" : 0,
"insertions" : 0,
"meta" : {},
"numErrors" : 0,
"numWordsInReference" : 7,
"substitutions" : 0,
"wer" : 0.0
}
},
"bigrams" :
{
"amount of" :
{
"correct" : 0,
"deletions" : 1,
"insertions" : 0,
"precision" : 0.0,
"recall" : 0.0,
"substitutions" : 0
},
```
The “bigrams” and “unigrams” fields are only populated with unigrams and bigrams that surpass the minimum frequency specified by the `--pr_threshold` flag, which is set to 0 by default.

### NLP

CLI flag: `--output-nlp`

Writes out the reference [NLP](https://github.com/revdotcom/fstalign/blob/develop/docs/NLP-Format.md), but with timings provided by a hypothesis CTM. Mostly relevant for the `align` subcommand.
2 changes: 1 addition & 1 deletion src/version.h
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#pragma once

#define FSTALIGNER_VERSION_MAJOR 1
#define FSTALIGNER_VERSION_MINOR 0
#define FSTALIGNER_VERSION_MINOR 1
#define FSTALIGNER_VERSION_PATCH 0

0 comments on commit 702b13e

Please sign in to comment.