This tool simplifies the creation of a Persian ASR dataset by aligning utterances within large audio files. It employs the CTC-segmentation algorithm (Ludwig Kürzinger et al., 2020) to determine the most probable alignment between corresponding text and speech. To address the
- Create a corrosponding transcript (one sentence per line) for each audio file.
- Create a
csv
file that contains relative paths to audio files and their transcripts. Samplemetadata.csv
:
audio_path,transcript_path
audios/1.mp3,transcripts/1.txt
audios/2.mp3,transcripts/2.txt
There is a sample_input
directory inside the repository that contains an example.
python segment.py \
--metadata metadata.csv \
--output_dir output \
--device cuda
Run python segment.py -h
for more information about the arguments.