Skip to content

Latest commit

 

History

History
53 lines (39 loc) · 2.68 KB

README.md

File metadata and controls

53 lines (39 loc) · 2.68 KB

Persian ASR Datasets

This file provides information about Persian ASR datasets. Each dataset has a corresponding folder that includes scripts to download and prepare the data for use in training ASR models. All dataset folders follow the same structure for consistency.

Directory Structure

Each dataset folder contains the following:

  • download.sh: Shell script to download the dataset to its corresponding folder.
  • prepare.py: Python script to prepare the dataset by creating subset files in JSON Lines (.jsonl) format.

Dataset Representation

This project uses the NeMo toolkit's convention for representing dataset subsets. Each dataset subset (such as train and test) is represented by a corresponding [subset].jsonl file. Each line in a .jsonl file is a JSON object with the following fields:

  • id: Unique identifier for the audio sample.
  • text: The transcription of the audio.
  • duration: Duration of the audio in seconds.
  • audio_filepath: Path to the audio file.

The prepare.py script in each dataset folder creates these .jsonl files.

Usage

  1. Navigate to the desired dataset folder.
  2. Run download.sh to download the dataset.
    ./download.sh
    Note: For the Common Voice Fa dataset, there is no direct link for automatic download. Please manually download the dataset from the Mozilla Common Voice website before running download.sh.
  3. Run prepare.py to prepare the data and generate the required .jsonl files.
    python prepare.py

Datasets

The following datasets are included in this directory:

Name Sample Rate Duration Samples Speakers
Shenasa AI1 16 KHz 200 GB Crawled -
Common Voice Fa - 300-400 hrs - -
ArmanAV - KHz 220 hrs - 1700
Deepmine - - 370K >1400
ASR Farsi Youtube1 - KHz - >140K Crawled
Farsdat2 22.5 KHz - - 300
ShEmo 44.1 KHz 3.5 hrs 3000 87
Persian Speech Corpus - 2.5 hrs 399 1
SFAVD - - - -

1 These datasets were crawled from the internet and do not have exact labels.

2 These datasets are not free.