List of benchmarks to evaluate the quality of your intent matching and entity recognition chatbot components. Given the myriad of NLP/NLP libraries to build your own chatbot (DialogFlow, Amazon Lex, Rasa, NLP.js, Xatkit, BESSER Bot Framework...), it's important to be able to have some datasets we could use to benchmark them.
To evaluate the quality of intent-matching and entity recognition components, we cannot just use raw NLP datasets. We need datasets that include:
- The user utterance
- The intent that should be matched given that utterance
- The list of entities that should be identified in that utterance
Obviously, even better if the dataset already comes with different sets of data for training, testing and validation so that different authors/vendors can more precisely replicate and report the evaluation results they get when evaluating a given library.
- NLU Evaluation Corpora. Three corpora which can be used for evaluating chatbots or other conversational interfaces. Two of the corpora were extracted from StackExchange, one from a Telegram chatbot. For instance, these corpora have been used in this benchmark.
- Home automation corpora. Natural language data for human-robot interaction in home domain. 25K entries. The Slurp dataset adds to this textual data the corresponding acoustic data to test voice bots.
- Massive. A parallel dataset of > 1M utterances across 52 languages. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset mentioned above.
- Clinc. An Evaluation Dataset for Intent Classification with a focus on testing the capabilities for Out-of-Scope predictions.
- Kaggle dataset for intent classification and NER. There are 7 intents. Data is in JSON format where each entity is also tagged.
- HINT3. Three new datasets created from live chatbots in diverse domains. Only intent matching data.
- Banking77. A fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents.
- XitXat. XitXat is a (Catalan) conversational dataset made of 950 chatbot conversations in 10 different domains.
- OpenAssistant Conversations Dataset (OASST1). A human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees.
Research works discussing, proposing or comparing NLP benchmarks:
- Benchmarking Commercial Intent Detection Services with Practice-Driven Evaluations
- Benchmarking Natural Language Understanding Services for building Conversational Agents
- Datasets for other NLP tasks: https://github.com/niderhoff/nlp-datasets
- Great post to get you started on the fascinating world of building your own chatbot platform
Feel free to open an issue or submit a pull request with any NLP dataset for chatbots that we may be missing (thanks!).