This file is here just as a support for development.
AUDIO
[TODO]:
- speech to text
-
to transcribe speech into text
- INPUT:
- a datasets object with the audio recordings in the "audio" column
- the audio column (default = "audio")
- the speech to text service to use (including the name, the version, the revision, and - for some services only and sometimes it's optional - the language of the transcription model we want to use)
- PREPROCESSING:
- adapt the language to the service format
- organize the dataset into batches
- PROCESSING:
- transcribe the dataset
- POSTPROCESSING:
- formatting the transcripts to follow a standard organization
- OUTPUT:
- a new dataset including only the transcripts of the audios in a standardized json format (plus an index?)
- TESTS:
- test input errors (a field is missing, the audio column exists and contains audio objects, params missing)
- test the transcript of a test file is ok
- test the language is supported (and the tool handles errors)
- INPUT:
-
to compute word error rate
- INPUT:
- a dataset object with the "transcript" and the "groundtruth" columns
- a service with a name (default is jitter)
- PROCESSING:
- computing the per-row WER between the 2 columns
- OUTPUT:
- a dataset with the "WER" column
- TESTS:
- test input errors (a field is missing, fields missing, the 2 columns don't contain strings)
- test output is ok
- INPUT:
-