This file provides information about Persian ASR datasets. Each dataset has a corresponding folder that includes scripts to download and prepare the data for use in training ASR models. All dataset folders follow the same structure for consistency.
Each dataset folder contains the following:
download.sh
: Shell script to download the dataset to its corresponding folder.prepare.py
: Python script to prepare the dataset by creating subset files in JSON Lines (.jsonl
) format.
This project uses the NeMo toolkit's convention for representing dataset subsets. Each dataset subset (such as train
and test
) is represented by a corresponding [subset].jsonl
file. Each line in a .jsonl
file is a JSON object with the following fields:
id
: Unique identifier for the audio sample.text
: The transcription of the audio.duration
: Duration of the audio in seconds.audio_filepath
: Path to the audio file.
The prepare.py
script in each dataset folder creates these .jsonl
files.
- Navigate to the desired dataset folder.
- Run
download.sh
to download the dataset.Note: For the Common Voice Fa dataset, there is no direct link for automatic download. Please manually download the dataset from the Mozilla Common Voice website before running./download.sh
download.sh
. - Run
prepare.py
to prepare the data and generate the required.jsonl
files.python prepare.py
The following datasets are included in this directory:
Name | Sample Rate | Duration | Samples | Speakers |
---|---|---|---|---|
Shenasa AI1 | 16 KHz | 200 GB | Crawled | - |
Common Voice Fa | - | 300-400 hrs | - | - |
ArmanAV | - KHz | 220 hrs | - | 1700 |
Deepmine | - | - | 370K | >1400 |
ASR Farsi Youtube1 | - KHz | - | >140K | Crawled |
Farsdat2 | 22.5 KHz | - | - | 300 |
ShEmo | 44.1 KHz | 3.5 hrs | 3000 | 87 |
Persian Speech Corpus | - | 2.5 hrs | 399 | 1 |
SFAVD | - | - | - | - |
1 These datasets were crawled from the internet and do not have exact labels.
2 These datasets are not free.