Skip to content

Latest commit

 

History

History
183 lines (106 loc) · 19.6 KB

README.md

File metadata and controls

183 lines (106 loc) · 19.6 KB

Awesome Disfluency Detection

A curated list of awesome disfluency detection publications along with their released code (if available) and bibliography. A chronological order of the published papers is available here.

Contributing

Please feel free to send me pull requests or email me to add a new resource.

Table of Contents

Papers

Studies on disfluency detection are categorized as follows (some papers belong to more than one category):

Noisy Channel Models

The main idea behind a noisy channel model of speech disfluency is that we assume there is a fluent source utterance x to which some noise has been added, resulting in a disfluent utterance y. Given y, the goal is to find the most likely source fluent sentence such that p(x|y) is maximized.

Sequence Tagging Models

The task of disfluency detection is framed as a word token classification problem, where each word token is classified as being disfluent/fluent or by using a begin-inside-outside (BIO) based tagging scheme.

Translation Based Models

Translation-based approaches for disfluency detection are commonly formulated as encoder-decoder systems, where the encoder learns the representation of input sentence containing disfluencies and the decoder learns to generate the underlying fluent version of the input.

Parsing Based Models

Parsing-based approaches detect disfluencies while simultaneously identifying the syntactic or semantic structure of the sentence. Training a parsing-based model requires large annotated treebanks that contain both disfluencies and syntactic/semantic structures.

Using Acoustic/Prosodic Cues

Speech signal carries extra information beyond the words which can provide useful cues for disfluency detection models. Some studies have explored integrating acoustic/prosodic cues to lexical features for detecting disfluencies.

Data Augmenatation Techniques

Disfluency detection models are usually trained and evaluated on Switchboard corpus. Switchboard is the largest disfluency annotated dataset; however, only about 6% of the words in the Switchboard are disfluent. Some studies have suggested new data augmentation techniques to mitigate the scarcity of gold disfluency-labeled data.

Incremental Disfluency Detection

Most disfluency detection models are developed based on the assumptions that a full sequence context as well as rich transcriptions including pre-segmentation information are available. These assumptions, however, are not valid in real-time scenarios where the input to the disfluency detector is live transcripts generated by a streaming ASR model. In such cases, a disfluency detector is expected to incrementally label input transcripts as it receives token-by-token data. Some studies have proposed new incremental disfluency detectors.

E2E Speech Recognition and Disfluency Removal

Most disfluency detectors are applied as an intermediate step between a speech recognition and a downstream task. Unlike the conventional pipeline models, some studies have explored end-to-end speech recoginition and disfluency removal.

E2E Speech Translation and Disfluency Removal

While most of the end-to-end speech translation studies have explored translating read speech, there are a few studies that examine the end-to-end conversational speech translation, where the task is to directly translate source disfluent speech into target fluent texts.

Others

Theses

Contact

Paria Jamshid Lou paria.jamshid-lou@hdr.mq.edu.au