A curated list of awesome disfluency detection publications along with their released code (if available) and bibliography. A chronological order of the published papers is available here.
Please feel free to send me pull requests or email me to add a new resource.
Studies on disfluency detection are categorized as follows (some papers belong to more than one category):
The main idea behind a noisy channel model of speech disfluency is that we assume there is a fluent source utterance x
to which some noise has
been added, resulting in a disfluent utterance y
. Given y
, the goal is to find the most likely source fluent sentence such that p(x|y)
is maximized.
-
Disfluency detection using a noisy channel model and a deep neural language model. Jamshid Lou et al. ACL 2017. [bib]
-
The impact of language models and loss functions on repair disfluency detection. Zwarts et al. ACL 2011. [bib]
-
An improved model for recognizing disfluencies in conversational speech. Johnson et al. Rich Transcription Workshop 2004. [bib]
-
A TAG-based noisy channel model of speech repair. Johnson et al. ACL 2004. [bib]
The task of disfluency detection is framed as a word token classification problem, where each word token is classified as being disfluent/fluent or by using a begin-inside-outside (BIO) based tagging scheme.
-
Joint prediction of punctuation and disfluency in speech transcripts. Lin et al. INTERSPEECH 2020. [bib]
-
Giving attention to the unexpected: using prosody innovations in disfluency detection. Zayats et al. NAACL 2019. [bib] [code]
-
Disfluency detection based on speech-aware token-by-token sequence labeling with BLSTM-CRFs and attention mechanisms. Tanaka et al. APSIPA 2019. [bib]
-
Noisy BiLSTM-based models for disfluency detection. Bach et al. INTERSPEECH 2019. [bib]
-
Disfluency detection using auto-correlational neural networks. Jamshid Lou et al. EMNLP 2018. [bib] [code]
-
Robust cross-domain disfluency detection with pattern match networks. Zayats et al. Arxiv 2018. [bib] [code]
-
Disfluency detection using a bidirectional LSTM. Zayats et al. INTERSPEECH 2016. [bib]
-
Multi-domain disfluency and repair detection. Zayats et al. INTERSPEECH 2014. [bib]
-
A Sequential Repetition Model for Improved Disfluency Detection. Ostendorf et al. INTERSPEECH 2013. [bib]
-
The role of disfluencies in topic classification of human-human conversations. Liu et al. IEEE TRANSACTIONS ON SPEECH & AUDIO PROCESSING 2006. [bib]
-
Automatic disfluency identification in conversational speech using multiple knowledge sources. Liu et al. Eurospeech 2003. [bib]
-
Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues. Baron et al. ICSLP 2002. [bib]
Translation-based approaches for disfluency detection are commonly formulated as encoder-decoder systems, where the encoder learns the representation of input sentence containing disfluencies and the decoder learns to generate the underlying fluent version of the input.
-
Adapting translation models for transcript disfluency detection. Dong et al. AAAI 2019. [bib]
-
Semi-supervised disfluency detection. Wang et al. COLING 2018. [bib]
-
A neural attention model for disfluency detection. Wang et al. COLING 2016. [bib]
Parsing-based approaches detect disfluencies while simultaneously identifying the syntactic or semantic structure of the sentence. Training a parsing-based model requires large annotated treebanks that contain both disfluencies and syntactic/semantic structures.
-
Semantic parsing of disfluent speech. Sen et al. EACL 2021.
-
Improving disfluency detection by self-training a self-attentive model. Jamshid Lou et al. ACL 2020. [bib] [code]
-
Neural constituency parsing of speech transcripts. Jamshid Lou et al. NAACL 2019. [bib] [code]
-
On the role of style in parsing speech with neural models. Tran et al. INTERSPEECH 2019. [bib] [code]
-
Parsing speech: a neural approach to integrating lexical and acoustic-prosodic information. Tran et al. NAACL 2018. [bib] [code]
-
Transition-based disfluency detection using LSTMs. Wang et al. EMNLP 2017. [bib] [code]
-
Joint transition-based dependency parsing and disfluency detection for automatic speech recognition texts. Yoshikawa et al. EMNLP 2016. [bib]
-
Joint incremental disfluency detection and dependency parsing. Honnibal et al. TACL 2014. [bib]
-
Joint parsing and disfluency detection in linear time. Rasooli et al. EMNLP 2013. [bib]
-
Edit detection and parsing for transcribed speech. Charniak et al. NAACL 2001. [bib]
Speech signal carries extra information beyond the words which can provide useful cues for disfluency detection models. Some studies have explored integrating acoustic/prosodic cues to lexical features for detecting disfluencies.
-
On the role of style in parsing speech with neural models. Tran et al. INTERSPEECH 2019. [bib] [code]
-
Disfluency detection based on speech-aware token-by-token sequence labeling with BLSTM-CRFs and attention mechanisms. Tanaka et al. APSIPA 2019. [bib]
-
Giving attention to the unexpected: using prosody innovations in disfluency detection. Zayats et al. NAACL 2019. [bib] [code]
-
Parsing speech: a neural approach to integrating lexical and acoustic-prosodic information. Tran et al. NAACL 2018. [bib] [code]
-
Automatic disfluency identification in conversational speech using multiple knowledge sources. Liu et al. Eurospeech 2003. [bib]
-
Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues. Baron et al. ICSLP 2002. [bib]
Disfluency detection models are usually trained and evaluated on Switchboard corpus. Switchboard is the largest disfluency annotated dataset; however, only about 6% of the words in the Switchboard are disfluent. Some studies have suggested new data augmentation techniques to mitigate the scarcity of gold disfluency-labeled data.
-
Disfluency detection with unlabeled data and small BERT models. Rocholl et al. Submitted to INTERSPEECH 2021.
-
Planning and generating natural and diverse disfluent texts as augmentation for disfluency detection. Yang et al. EMNLP 2020. [bib] [code]
-
Combining self-training and self-supervised learning for unsupervised disfluency detection. Wang et al. EMNLP 2020. [bib] [code]
-
Improving disfluency detection by self-training a self-attentive model. Jamshid Lou et al. ACL 2020. [bib] [code] [data]
-
Auxiliary sequence labeling tasks for disfluency detection. Lee et al. arxiv 2020.
-
Multi-task self-supervised learning for disfluency detection. Wang et al. AAAI 2020. [bib]
-
Noisy BiLSTM-based models for disfluency detection. Bach et al. INTERSPEECH 2019. [bib]
-
Semi-supervised disfluency detection. Wang et al. COLING 2018. [bib]
Most disfluency detection models are developed based on the assumptions that a full sequence context as well as rich transcriptions including pre-segmentation information are available. These assumptions, however, are not valid in real-time scenarios where the input to the disfluency detector is live transcripts generated by a streaming ASR model. In such cases, a disfluency detector is expected to incrementally label input transcripts as it receives token-by-token data. Some studies have proposed new incremental disfluency detectors.
-
Re-framing incremental deep language models for dialogue processing with multi-task learning. Rohanian et al. COLING 2020. [bib] [code]
-
Recurrent neural networks for incremental disfluency detection. Hough et al. INTERSPEECH 2015. [bib]
-
Joint incremental disfluency detection and dependency parsing. Honnibal et al. TACL 2014. [bib]
Most disfluency detectors are applied as an intermediate step between a speech recognition and a downstream task. Unlike the conventional pipeline models, some studies have explored end-to-end speech recoginition and disfluency removal.
-
Improved robustness to disfluencies in RNN-Transducer based speech recognition. Mendelev et al. Arxiv 2020. [bib]
-
End-to-end speech recognition and disfluency removal. Jamshid Lou et al. EMNLP Findings 2020. [bib] [code]
While most of the end-to-end speech translation studies have explored translating read speech, there are a few studies that examine the end-to-end conversational speech translation, where the task is to directly translate source disfluent speech into target fluent texts.
-
NAIST’s machine translation systems for IWSLT 2020 conversational speech translation task. Fukuda et al. IWSLT 2020. [bib]
-
Generating fluent translations from disfluent text without access to fluent references: IIT Bombay@IWSLT2020. Saini et al. IWSLT 2020. [bib]
-
Fluent translations from disfluent speech in end-to-end speech translation. Salesky et al. NAACL 2019. [bib] [data]
-
Segmentation and disfluency removal for conversational speech translation. Hassan et al. INTERSPEECH 2014. [bib]
-
Analysis of Disfluency in Children’s Speech. Tran et al. INTERSPEECH 2020. [bib]
-
Speech disfluencies occur at higher perplexities. Sen. Cognitive Aspects of the Lexicon Workshop 2020. [bib]
-
Controllable time-delay transformer for real-time punctuation prediction and disfluency detection. Chen et al. ICASSP 2020. [bib]
-
Expectation and locality effects in the prediction of disfluent fillers and repairs in English speech. Dammalapati et al. NAACL Student Research Workshop 2019. [bib]
-
Disfluencies and human speech transcription errors. Zayats et al. INTERSPEECH 2019. [bib] [data]
-
Unediting: detecting disfluencies without careful transcripts. Zayats et al. NAACL 2015. [bib]
-
The role of disfluencies in topic classification of human-human conversations. Boulis et al. AAAI Workshop 2005.
- Preliminaries to a theory of speech disfluencies. Shriberg. PhD Thesis 1994. [bib]
- Disfluent Speech Segments Detection and Remediation. Arbajian. PhD Thesis 2019.
Paria Jamshid Lou paria.jamshid-lou@hdr.mq.edu.au