This project develops a new database (containing audio files with a single word in it) from a standard speech corpus. So, the python scripts (in this repository) take a speech corpus (e.g. LibriSpeech) and split the audio files (containing utterances) into audio files (each containing a single word). Hence, the output of the scripts will audio files corresponding to one word per file.
-
Run Google cloud speech-to-text or any standard ASR API on the speech corpus
-
Calculate the utterance level WER by comparing the obtained hypothesis with the available ground-truth transcript.
If WER==0:
- Perform utterance-to-word trimming based on time-tags
- Store the meaningful meta-Info from the original dataset, such as speaker ID, speaker gender, etc.
-
For most spoken languages, the boundaries between lexical units are difficult to identify. There is hardly any pauses in between two words.
-
Inter-word spaces used by many written languages, would rarely correspond to pauses in their spoken version. Only in case of very slow speech, the speaker deliberately inserts those pauses.
-
In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word (Co-articulation effect).
-
The script is by default written for LibriSpeech corpus. If you want to use some other corpus, please check config.py and instructions given below
-
Set the LibriSpeech corpus path in config.py
-
Set the path of Google-API key which is required to run Google Cloud ASR.
-
Use the conda environment: environment_utt_to_words.yaml and then run the top-level script main_utt_to_words.py
Start your shell (terminal):
$ conda env create -f environment_utt_to_words.yaml
$ source activate utterance_to_word
$ python3 main_utt_to_words.py
-
First of all, set the path_to_corpus in config.py
-
Set the speaker_meta_info_file_name and path of the same in config.py.
-
Set the path of Google-API key (in config.py) which is required to run Google Cloud ASR.
-
There are two customized functions in corpus_functions.py, namely extract_reference_transcript and find_speaker_gender. You need to modify them as per corpus requirement.
-
If the file-formats and sampling rate of the audio files in your corpus are different from LibriSpeech, then please change the following portion in run_ASR method in transcribe.py: config = types.RecognitionConfig( encoding=enums.RecognitionConfig.AudioEncoding.FLAC, sample_rate_hertz=16000, language_code='en-US', enable_word_time_offsets=True)
please contact: samuisuman@gmail.com