Skip to content

Latest commit

 

History

History
141 lines (107 loc) · 14.4 KB

README.md

File metadata and controls

141 lines (107 loc) · 14.4 KB

Open ASR Corpora

A list of open(ish) corpora for Automatic Speech Recognition research and development.

This list has a preference for free (i.e. no $ cost) and truly open corpora (i.e. some kind of Creative Commons license).

However, not all corpora listed here meet those criteria, but all corpora here are accessible and usable for research and/or commercial use. Some paid corpora with restrictive licenses may be included here (i.e. from the LDC), given their wide use in research and industry.

Feel free to propse additions to the list!

CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CommonVoice English English 582 hours (validated); 803 hours (total) 33,541 speakers (reported: 10% female / 41% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice German German 140 hours (validated); 146 hours (total) 2,249 speakers (reported: 5% female / 76% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice French French 74 hours (validated); 79 hours (total) 1,697 speakers (reported: 7% female / 72% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Welsh Welsh 21 hours (validated); 22 hours (total) 365 speakers (reported: 26% female / 43% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Breton Breton 2 hours (validated); 7 hours (total) 82 speakers (reported: 2% female / 43% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Chuvash Chuvash <1 hour (validated); 2 hours (total) 33 speakers (reported: 0% female / 46% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Turkish Turkish 5 hours (validated); 6 hours (total) 203 speakers (reported: 7% female / 75% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Tatar Tatar 20 hours (validated); 20 hours (total) 117 speakers (reported: 2% female / 80% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Kyrgyz Kyrgyz 5 hours (validated); 6 hours (total) 63 speakers (reported: 6% female / 80% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Irish Irish 1 hour (validated); 1 hour (total) 30 speakers (reported: 22% female / 57% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Kabyle Kabyle 92 hours (validated); 98 hours (total) 382 speakers (reported: 17% female / 53% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Catalan Catalan 92 hours (validated); 98 hours (total) 1,639 speakers (reported: 44% female / 38% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Chinese (Taiwan) Mandarin (Taiwan) 19 hours (validated); 28 hours (total) 695 speakers (reported: 35% female / 38% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Slovenian Slovenian 1 hour (validated); 3 hours (total) 18 speakers (reported: 17% female / 82% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Italian Italian 15 hours (validated); 19 hours (total) 313 speakers (reported: 7% female / 67% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Dutch Dutch 12 hours (validated); 13 hours (total) 373 speakers (reported: 2% female / 74% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Hakha Chin Hakha Chin 2 hours (validated); 4 hours (total) 253 speakers (reported: 22% female / 26% male) https://voice.mozilla.org/en/datasets CC-0
CommonVoice Esperanto Esperanto 4 hours (validated); 6 hours (total) 53 speakers (reported: 10% female / 21% male) https://voice.mozilla.org/en/datasets CC-0
Yesno Hebrew 6 mins one male http://www.openslr.org/1/ CC-0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
African Speech Technology English-English Speech Corpus English ~21 hours https://repo.sadilar.org/handle/20.500.12185/283 CC-BY 2.5 South Africa
African Speech Technology isiXhosa Speech Corpus isiXhosa ~26 hours https://repo.sadilar.org/handle/20.500.12185/305 CC-BY 2.5 South Africa
NCHLT Afrikaans Afrikaans 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/280 CC-BY 3.0
NCHLT English English 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/274 CC-BY 3.0
NCHLT isiNdebele isiNdebele 56 hours 148 speakers (78 female / 70 male) https://repo.sadilar.org/handle/20.500.12185/272 CC-BY 3.0
NCHLT isiXhosa isiXhosa 56 hours 209 speakers (106 female / 103 male) https://repo.sadilar.org/handle/20.500.12185/279 CC-BY 3.0
NCHLT isiZulu isiZulu 56 hours 210 speakers (98 female / 112 male) https://repo.sadilar.org/handle/20.500.12185/275 CC-BY 3.0
NCHLT Sepedi Sepedi 56 hours 210 speakers (100 female / 110 male) https://repo.sadilar.org/handle/20.500.12185/270 CC-BY 3.0
NCHLT Sesotho Sesotho 56 hours 210 speakers (113 female / 97 male) https://repo.sadilar.org/handle/20.500.12185/278 CC-BY 3.0
NCHLT Setswana Setswana 56 hours 210 speakers (109 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/281 CC-BY 3.0
NCHLT Siswati Siswati 56 hours 197 speakers (96 female / 101 male) https://repo.sadilar.org/handle/20.500.12185/271 CC-BY 3.0
NCHLT Tshivenda Tshivenda 56 hours 208 speakers (83 female / 125 male) https://repo.sadilar.org/handle/20.500.12185/276 CC-BY 3.0
NCHLT Xitsonga Xitsonga 56 hours 198 speakers (95 female/103 male) https://repo.sadilar.org/handle/20.500.12185/277 CC-BY 3.0
Lwazi II Cross-lingual Proper Name Corpus Afrikaans; English; isiZulu; Sesotho 2 hours 5 mins 20 speakers https://repo.sadilar.org/handle/20.500.12185/445 CC-BY 3.0
Lwazi II Proper Name Call Routing Telephone Corpus English 2 hours 7 mins https://repo.sadilar.org/handle/20.500.12185/448 CC-BY 3.0
Lwazi II Afrikaans Trajectory Tracking Corpus Afrikaans 4 hours one male https://repo.sadilar.org/handle/20.500.12185/442 CC-BY 3.0
LibriSpeech English ~1000 hours 2484 speakers (1201 female / 1283 male) http://www.openslr.org/12/ CC-BY 4.0
Zeroth-Korean Korean 52.8 hours 115 speakers http://www.openslr.org/40/ CC-BY 4.0
Speech Commands English 17.8 hours >1,000 speakers https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html CC-BY 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Iban Iban 8 hours http://www.openslr.org/24/ https://github.com/sarahjuan/iban CC-BY-SA 2.0
Vystadial English; Czech 41 hours; 15 hours http://www.openslr.org/6/ CC-BY-SA 3.0 US
Free Spoken Digit Dataset English 2,000 isolated digits 4 speakers https://github.com/Jakobovski/free-spoken-digit-dataset CC-BY-SA 4.0
Google Javanese Javanese 296 hours 1019 speakers http://www.openslr.org/35/ CC-BY-SA 4.0
Google Nepali Nepali 165 hours 527 speakers http://www.openslr.org/54/ CC-BY-SA 4.0
Google Bengali Bengali 229 hours 508 speakers http://www.openslr.org/53/ CC-BY-SA 4.0
Google Sinhala Sinhala 224 hours 478 speakers http://www.openslr.org/52/ CC-BY-SA 4.0
Google Sundanese Sundanese 333 hours 542 speakers http://www.openslr.org/36/ CC-BY-SA 4.0
SWC-2017 English; German; Dutch 182 hours; 249 hours; 79 hours 395 speakers; 339 speakers; 145 speakers https://nats.gitlab.io/swc/ CC-BY-SA 4.0
Chuvash TTS Chuvash 4 hours 1 speaker https://github.com/ftyers/Turkic_TTS CC-BY-SA 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
IBM Recorded Debates v1 English 5 hours 10 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND
IBM Recorded Debates v2 English ~14 hours 14 speakers https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis CC-BY-ND
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CHiME-Home English 6.8 hours https://archive.org/details/chime-home CC-BY-NC-SA 3.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
Tatoeba-Eng English ~250 hours (rough estimate) 6 speakers https://voice.mozilla.org/en/datasets CC BY-NC 4.0 (some audio) / CC BY-NC-ND 3.0 (most audio) / CC BY 2.0 (all text)
TED-LIUM English 118 hours 685 speakers (36h female / 81h male) http://www.openslr.org/7/ CC-BY-NC-ND 3.0
TED-LIUM-2 English 207 hours 1242 speakers (66h female / 141h male) http://www.openslr.org/19/ CC-BY-NC-ND 3.0
TED-LIUM-3 English 452 hours 2028 speakers (134h female / 316h male) http://www.openslr.org/51/ CC-BY-NC-ND 3.0
Pansori TEDxKR Korean 3 hours 41 speakers http://www.openslr.org/58/ CC-BY-NC-ND 4.0
Primewords Mandarin Mandarin 100 hours 296 speakers http://www.openslr.org/47/ CC-BY-NC-ND 4.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
VoxForge English ~120 hours ~2966 speakers http://www.voxforge.org/home/downloads https://voice.mozilla.org/en/datasets GNU-GPL 3.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
AISHELL-1 Mandarin 170 hours 400 speakers http://www.openslr.org/33/ Apache 2.0
Tunisian_MSA Modern Standard Arabic (Tunisia) 11.2 hours 118 speakers http://www.openslr.org/46/ Apache 2.0
African Accented French French 22 hours 232 speakers http://www.openslr.org/57/ Apache 2.0
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
ALFFA Amharic;Hausa (paid); Swahili; Wolof http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC MIT
CORPUS LANGUAGES # HOURS # SPEAKERS DOWNLOAD LICENSE
CMU Wilderness 700 Langs Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours https://github.com/festvox/datasets-CMU_Wilderness Questionable Legality: https://live.bible.is/terms
CHiME-5 English 50 hours 48 speakers http://spandh.dcs.shef.ac.uk/chime_challenge/data.html CHiME-5 License
FalaBrasil-LAPS-Constituicao Brazilian-Portuguese 9 hours 1 speaker https://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPSMail Brazilian-Portuguese 1 hour 25 speakers https://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPS Benchmark Brazilian-Portuguese 1 hour 1 speaker https://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo "Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
Fearless Steps Corpus English 19,000 hours (20 hours transcribed) ~450 speakers http://fearlesssteps.exploreapollo.org/
Microsoft Speech Corpus (Indian languages) Telugu; Tamil; Gujarati https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e Non-Commercial Microsoft Speech Corpus (Indian Languages) License
Microsoft Speech Language Translation Corpus English; Chinese; Japanese https://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187 Non-Commercial Microsoft Research Data License Agreement
Hey Snips Corpus English 11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances 2215 speakers (positive & negative) and 4028 speakers (negative only) https://research.snips.ai/datasets/keyword-spotting Snips Data License
Snips SLU Corpus English; French 1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterances English: 69 speakers; French: 30 speakers https://research.snips.ai/datasets/spoken-language-understanding Snips Data License