Persian ASR Datasets

This file provides information about Persian ASR datasets. Each dataset has a corresponding folder that includes scripts to download and prepare the data for use in training ASR models. All dataset folders follow the same structure for consistency.

Directory Structure

Each dataset folder contains the following:

download.sh: Shell script to download the dataset to its corresponding folder.
prepare.py: Python script to prepare the dataset by creating subset files in JSON Lines (.jsonl) format.

Dataset Representation

This project uses the NeMo toolkit's convention for representing dataset subsets. Each dataset subset (such as train and test) is represented by a corresponding [subset].jsonl file. Each line in a .jsonl file is a JSON object with the following fields:

id: Unique identifier for the audio sample.
text: The transcription of the audio.
duration: Duration of the audio in seconds.
audio_filepath: Path to the audio file.

The prepare.py script in each dataset folder creates these .jsonl files.

Usage

Navigate to the desired dataset folder.
Run download.sh to download the dataset.
```
./download.sh
```
Note: For the Common Voice Fa dataset, there is no direct link for automatic download. Please manually download the dataset from the Mozilla Common Voice website before running download.sh.
Run prepare.py to prepare the data and generate the required .jsonl files.
```
python prepare.py
```

Datasets

The following datasets are included in this directory:

Name	Sample Rate	Duration	Samples	Speakers
Shenasa AI¹	16 KHz	200 GB	Crawled	-
Common Voice Fa	-	300-400 hrs	-	-
ArmanAV	- KHz	220 hrs	-	1700
Deepmine	-	-	370K	>1400
ASR Farsi Youtube¹	- KHz	-	>140K	Crawled
Farsdat²	22.5 KHz	-	-	300
ShEmo	44.1 KHz	3.5 hrs	3000	87
Persian Speech Corpus	-	2.5 hrs	399	1
SFAVD	-	-	-	-

¹ These datasets were crawled from the internet and do not have exact labels.

² These datasets are not free.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Persian ASR Datasets

Directory Structure

Dataset Representation

Usage

Datasets

Files

README.md

Latest commit

History

README.md

File metadata and controls

Persian ASR Datasets

Directory Structure

Dataset Representation

Usage

Datasets