Skip to content

Commit

Permalink
Modify kenlm dependency for pypi compatibility (#2)
Browse files Browse the repository at this point in the history
  • Loading branch information
gkucsko authored Jun 12, 2021
1 parent 60bbb29 commit 24b61ba
Show file tree
Hide file tree
Showing 7 changed files with 44 additions and 32 deletions.
4 changes: 1 addition & 3 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
[run]
omit = tests/*
dynamic_context = test_function
omit =
# No coverage for tests
pyctcdecode/tests/*

[report]
# Regexes for lines to exclude from consideration
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/tests_and_lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install https://github.com/kpu/kenlm/archive/master.zip
pip install -e .[dev]
- name: Run lint checks
run: |
Expand All @@ -47,6 +48,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install https://github.com/kpu/kenlm/archive/master.zip
pip install -e .[dev]
- name: Test with pytest
run: |
Expand Down
36 changes: 21 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
<a href="http://www.repostatus.org/#active"><img src="http://www.repostatus.org/badges/latest/active.svg" /></a>
<a href="https://github.com/kensho-technologies/pyctcdecode/actions?query=workflow%3A%22Tests+and+lint%22"><img src="https://github.com/kensho-technologies/pyctcdecode/workflows/Tests%20and%20lint/badge.svg" /></a>
<a href="https://codecov.io/gh/kensho-technologies/pyctcdecode"><img src="https://codecov.io/gh/kensho-technologies/pyctcdecode/branch/main/graph/badge.svg" /></a>
<a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" /></a>
<a href="http://www.repostatus.org/#active"><img src="http://www.repostatus.org/badges/latest/active.svg" /></a>
<a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg" /></a>

## pyctcdecode

A fast and feature-rich CTC beam search decoder for speech recognition written in Python, offering n-gram (kenlm) language model support similar to DeepSpeech, but incorporating many new features such as byte pair encoding to support modern architectures like Nvidia's [Conformer-CTC](tutorials/01_pipeline_nemo.ipynb) or Facebooks's [Wav2Vec2](tutorials/02_asr_huggingface.ipynb).
A fast and feature-rich CTC beam search decoder for speech recognition written in Python, providing n-gram (kenlm) language model support similar to PaddlePaddle's decoder, but incorporating many new features such as byte pair encoding and real-time decoding to support models like Nvidia's [Conformer-CTC](tutorials/01_pipeline_nemo.ipynb) or Facebook's [Wav2Vec2](tutorials/02_asr_huggingface.ipynb).

``` bash
pip install pyctcdecode
Expand All @@ -15,10 +17,10 @@ pip install pyctcdecode
- 🔥 hotword boosting
- 🤖 handling of BPE vocabulary
- 👥 multi-LM support for 2+ models
- 🕒 stateful LM for realtime decoding
- 🕒 stateful LM for real-time decoding
- ✨ native frame index annotation of words
- 💨 fast runtime, comparable to C++ implementation
- 🐍 easy to modify Python code
- 🐍 easy-to-modify Python code

### Quick Start:

Expand All @@ -45,7 +47,7 @@ decoder = build_ctcdecoder(
text = decoder.decode(logits)
```

if the vocabulary is BPE based, adjust the labels and set the `is_bpe` flag (merging of tokens for the LM is handled automatically):
If the vocabulary is BPE based, adjust the labels and set the `is_bpe` flag (merging of tokens for the LM is handled automatically):

``` python
labels = ["<unk>", "▁bug", "s", "▁bunny"]
Expand All @@ -58,14 +60,18 @@ decoder = build_ctcdecoder(
text = decoder.decode(logits)
```

improve domain specificity by adding hotwords during inference:
Improve domain specificity by adding important contextual words ("hotwords") during inference:

``` python
hotwords = ["looney tunes", "anthropomorphic"]
text = decoder.decode(logits, hotwords=hotwords)
text = decoder.decode(
logits,
hotwords=hotwords,
hotwords_weight=10.0,
)
```

batch support via multiprocessing:
Batch support via multiprocessing:

``` python
from multiprocessing import Pool
Expand All @@ -74,7 +80,7 @@ with Pool() as pool:
text_list = decoder.decode_batch(logits_list, pool)
```

use `pyctcdecode` for a production Conformer-CTC model:
Use `pyctcdecode` for a pretrained Conformer-CTC model:

``` python
import nemo.collections.asr as nemo_asr
Expand All @@ -88,25 +94,25 @@ decoder = build_ctcdecoder(asr_model.decoder.vocabulary, is_bpe=True)
decoder.decode(logits)
```

The tutorials folder contains many well documented notebook examples on how to run speech recognition from scratch using pretrained models from Nvidia's [NeMo](https://github.com/NVIDIA/NeMo) and Huggingface/Facebook's [Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html).
The tutorials folder contains many well documented notebook examples on how to run speech recognition using pretrained models from Nvidia's [NeMo](https://github.com/NVIDIA/NeMo) and Huggingface/Facebook's [Wav2Vec2](https://huggingface.co/transformers/model_doc/wav2vec2.html).

For more details on how to use all of pyctcdecode's features, have a look at our [main tutorial](tutorials/00_basic_usage.ipynb).

### Why pyctcdecode?

The flexibility of using Python allows us to implement various new features while keeping runtime competitive through little tricks like caching and beam pruning. When comparing pyctcdecode's runtime and accuracy to a standard C++ decoders, we see favorable trade offs between speed and accuracy, see code [here](tutorials/03_eval_performance.ipynb).
In scientific computing, there’s often a tension between a language’s performance and its ease of use for prototyping and experimentation. Although C++ is the conventional choice for CTC decoders, we decided to try building one in Python. This choice allowed us to easily implement experimental features, while keeping runtime competitive through optimizations like caching and beam pruning. We compare the performance of `pyctcdecode` to an industry standard C++ decoder at various beam widths (shown as inline annotations), allowing us to visualize the trade-off of word error rate (y-axis) vs runtime (x-axis). For beam widths of 10 or greater, pyctcdecode yields strictly superior performance, with lower error rates in less time, see code [here](tutorials/03_eval_performance.ipynb).

<p align="center"><img src="docs/images/performance.png"></p>

Python also allows us to do nifty things like hotword support (at inference time) with only a few lines of code.
The use of Python allows us to easily implement features like hotword support with only a few lines of code.

<p align="center"><img width="800px" src="docs/images/hotwords.png"></p>

The full beam results contain the language model state to enable real time inference as well as word based logit indices (frames) to calculate timing and confidence scores of individual words natively through the decoding process.
`pyctcdecode` can return either a single transcript, or the full results of the beam search algorithm. The latter provides the language model state to enable real-time inference as well as word-based logit indices (frames) to enable word-based timing and confidence score calculations natively through the decoding process.

<p align="center"><img width="450px" src="docs/images/beam_output.png"></p>

Additional features such as BPE vocabulary as well as examples of pyctcdecode as part of a full speech recognition pipeline can be found in the [tutorials section](tutorials).
Additional features such as BPE vocabulary, as well as examples of `pyctcdecode` as part of a full speech recognition pipeline, can be found in the [tutorials section](tutorials).

### Resources:

Expand Down
10 changes: 7 additions & 3 deletions pyctcdecode/decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,16 @@
from .language_model import AbstractLanguageModel, HotwordScorer, LanguageModel


logger = logging.getLogger(__name__)


try:
import kenlm # type: ignore
except ImportError:
pass
logger.warning(
"kenlm python bindings are not installed. Most likely you want to install it using: "
"pip install https://github.com/kpu/kenlm/archive/master.zip"
)


# type hints
Expand All @@ -53,8 +59,6 @@
NULL_FRAMES: Frames = (-1, -1) # placeholder that gets replaced with positive integer frame indices
EMPTY_START_BEAM: Beam = ("", "", "", None, [], NULL_FRAMES, 0.0)

logger = logging.getLogger(__name__)


def _normalize_whitespace(text: str) -> str:
"""Efficiently normalize whitespace."""
Expand Down
9 changes: 8 additions & 1 deletion pyctcdecode/language_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from __future__ import division

import abc
import logging
import re
from typing import Iterable, List, Optional, Pattern, Tuple

Expand All @@ -19,10 +20,16 @@
)


logger = logging.getLogger(__name__)


try:
import kenlm # type: ignore
except ImportError:
pass
logger.warning(
"kenlm python bindings are not installed. Most likely you want to install it using: "
"pip install https://github.com/kpu/kenlm/archive/master.zip"
)


def _get_empty_lm_state() -> kenlm.State:
Expand Down
5 changes: 0 additions & 5 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,6 @@
from setuptools import find_packages, setup # type: ignore


# https://packaging.python.org/guides/single-sourcing-package-version/
# #single-sourcing-the-version


logger = logging.getLogger(__name__)


Expand Down Expand Up @@ -57,7 +53,6 @@ def find_long_description():
"codecov",
"flake8",
"jupyter",
"kenlm@https://github.com/kpu/kenlm/archive/master.zip",
"mypy",
"nbconvert",
"nbformat",
Expand Down
10 changes: 5 additions & 5 deletions tutorials/02_pipeline_huggingface.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## How to use pyctcdecode when working with a Hugginface model"
"## How to use pyctcdecode when working with a Huggingface model"
]
},
{
Expand Down Expand Up @@ -74,9 +74,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The vocabulary is in a slighly unconventional shape so we will replace `\"<pad>\"` with `\"\"` and `\"|\"` with `\" \"` as well as the other special tokens (which are essentially unused)\n",
"The vocabulary is in a slightly unconventional shape so we will replace `\"<pad>\"` with `\"\"` and `\"|\"` with `\" \"` as well as the other special tokens (which are essentially unused)\n",
"\n",
"We need to standaradize the special tokens and then specifically pass which index is the ctc blank token index (since it's not the last). For that reason we have to manually build the Alphabet and the decoder instead of using the convenience wrapper `build_ctcdecoder`."
"We need to standardize the special tokens and then specifically pass which index is the ctc blank token index (since it's not the last). For that reason we have to manually build the Alphabet and the decoder instead of using the convenience wrapper `build_ctcdecoder`."
]
},
{
Expand Down Expand Up @@ -108,8 +108,8 @@
"vocab_list[3] = \"\"\n",
"# convert space character representation\n",
"vocab_list[4] = \" \"\n",
"# specify ctc blank char index, since conventially it is the last entry of the logit matrix\n",
"alphabet = Alphabet.build_bpe_alphabet(vocab_list, ctc_token_idx=0)\n",
"# specify ctc blank char index, since conventionally it is the last entry of the logit matrix\n",
"alphabet = Alphabet.build_alphabet(vocab_list, ctc_token_idx=0)\n",
"\n",
"# build the decoder and decode the logits\n",
"decoder = BeamSearchDecoderCTC(alphabet)\n",
Expand Down

0 comments on commit 24b61ba

Please sign in to comment.